Exploring an enriched Iowa liquor sales dataset

by Alex Spanos

Introduction

Welcome to my Exploratory Data Analysis project for Udacity’s Data Analyst Nanodegree. As I wanted my work to be somewhat more original, I forewent the option of analysing one of the datasets in the Udacity list and went on to spend a few weeks trying to locate an appropriate dataset online; appropriate in this context meant firstly satisfying Udacity prerequisites and secondly being a real-world recent dataset, relatively “virgin” with respect to analysis attempts.

There are a few excellent directories of free datasets out there, as well as free APIs. The one that intrigued me the most though was the Iowa Liquor Sales dataset. This dataset contains transactions between the relevant Iowa state authority and liquor stores across the state with a very high level of detail. I was surprised to find out that in Iowa alcohol sales are state-controlled; it seems that all orders have to go through the state! Anyway, the reasons why I found this dataset interesting were the following:

The final point was quite important for me, because what mostly intrigues me in this field is discovering relationships and patterns which are not directly obvious (well I guess that applies for most of us here).

So I then spent a bit of time trying to think of alternative data that I could merge into the master (raw) dataset. Given the richly detailed data at zipcode level, I managed to pull from the internet the following zipcode level information:

I did not know beforehand whether this would definitely provide interesting insights with regards to liquor consumption, but at least I could supplement my project with some descriptives about Iowa (which I knew next to nothing about!). Additionally, it was greatly enjoyable to exercise the skills I learnt in the Nanodegree’s Data Wrangling course. I got to use a few API’s, scraped some tables, used some regex and more in order to get all my auxiliary data sets in a neat csv format. Here I cheated a bit and used Python. After this data was in csv format though I performed all subsequent data merging operations in R.

For the analysis purposes of this project I used RStudio on my Windows 7 laptop and also used git and GitHub for version control.

So, on to the data analysis.

Data

Loading

# Load and pre-process liquor sales data ----------------------------------

#Using amazingly fast fread function to read in 800MB csv file
library(data.table)
data1 <- fread('./data/raw/Iowa_Liquor_Sales.csv', header = T, sep = ',',
               verbose=TRUE)
#Convert dates to Date objects
data1$DATE <- as.Date(data1$DATE, "%m/%d/%Y")

#Fix ZIPCODE issues
data_zipcodes <- unique(data1$ZIPCODE)
#Rogue zipcodes are 97 (sioux city, Morningside avenue) and NA (Dunlap).
##Dunlap is 61525 and Sioux city is 51106 (googled it)
summary(data_zipcodes)
min(na.omit(data_zipcodes))
max(na.omit(data_zipcodes))
sum(is.na(data1$ZIPCODE))
data1$CITY[is.na(data1$ZIPCODE)]
data1$ZIPCODE[is.na(data1$ZIPCODE)] <- 61525
data1$CITY[data1$ZIPCODE==97]
data1$ZIPCODE[data1$ZIPCODE==97] <- 51106

#Convert ZIPCODE to unordered factor (nominal variable)
data1$ZIPCODE <- factor(data1$ZIPCODE, ordered=F)

#Omit dollars in cost and price variables and convert to numeric;
#could have used gsub
data1$`STATE BTL COST` <- sapply(strsplit(data1$`STATE BTL COST`, 
                                          split='$', fixed=TRUE), 
                                 function(x) (x[2]))
data1$`STATE BTL COST` <- as.numeric(data1$`STATE BTL COST`)
data1$`BTL PRICE` <- sapply(strsplit(data1$`BTL PRICE`, split='$',
                                     fixed=TRUE), function(x) (x[2]))
data1$`BTL PRICE` <- as.numeric(data1$`BTL PRICE`)
data1$`TOTAL` <- sapply(strsplit(data1$`TOTAL`, split='$', fixed=TRUE),
                        function(x) (x[2]))
data1$`TOTAL` <- as.numeric(data1$`TOTAL`)

#The STORE LOCATION variable holds latitude and longitude values
#These will be useful in plotting on maps so let's extract them.
data1$STORE_LAT <- sapply(data1$`STORE LOCATION`, extract_lat_lon)[1,]
data1$STORE_LON <- sapply(data1$`STORE LOCATION`, extract_lat_lon)[2,]

#Dropping STORE LOCATION as it is redundant
select(data1, -`STORE LOCATION`)

#2015 data are not for the full year so I will just keep 2014
data1 <- subset(data1, DATE<"2015-01-01")
# Load auxilliary data in order to merge with main ---------


# Establishment data set --------------------------------------------------


#Load data
# I found this data set on https://www.census.gov/econ/geo-zip.html (i think)
# or here http://factfinder.census.gov/
# faces/affhelp/jsf/pages/metadata.xhtml?lang=en&type=dataset&id=dataset.en.BP_2011
est_data <- fread('./data/raw/iowa_food_drink.csv', header = T, sep = ',', verbose=TRUE)

#Pre-process data

library(tidyr)
library(dplyr)
#Drop unnecessary columns. Converted to data frame.
est_data <- select(as.data.frame(est_data), -GEOGRAPHY, -NAICS2007)
#Remove duplicates
est_data <- est_data[!duplicated(est_data2),]
#Transform dataset
est_data <- spread(est_data, NAICS2007_MEANING, ESTAB)
#Set missing to zero
est_data[is.na(est_data)] <- 0
#Make it a data table again
est_data <- as.data.table(est_data)


# Population data set -----------------------------------------------------


# Load data
# I don't remember where I got this data from
pop_data <- fread('./data/raw/Iowa_population.csv', header = T, sep = ',', verbose=TRUE)


# Weather data set --------------------------------------------------------

# Ordered daily data for 2014 from
#http://www.ncdc.noaa.gov/cdo-web/datatools/records
# Unfortunately data was very incomplete.

## Load data
weather_data <- fread('./data/raw/Iowa_weather.csv', header = T, 
                      sep = ',', verbose=TRUE)

## Pre-condition data
weather_data <- select(weather_data, -STATION, -TSUN, -AWND, -WT09, -WT01,
                       -WT06, -WT05, -WT02, -WT11, -WT04, -WT08, -WT03)
weather_data$ELEVATION <- as.numeric(weather_data$ELEVATION)
weather_data$LATITUDE <- as.numeric(weather_data$LATITUDE)
weather_data$LONGITUDE <- as.numeric(weather_data$LONGITUDE)
weather_data$DATE <- as.Date(as.character(weather_data$DATE), "%Y%m%d")
#Celsius convert. range of T is -200 to 200; something wrong
#weather_data$TMIN <- (weather_data$TMIN-32)*5/9

# Measurements do not include zipcode, so we need to associate zipcode to
#latitude and longitude. For this purpose, I pulled the relevant data from
#http://www.geonames.org/export/web-services.html#findNearbyPostalCodes
#by using their really nice API in Python (query input lat/lon and output
#is zipcode. I then made a .csv)

# Insert associated zipcode - lat/lon data
zip_lat_lon_data <- fread('./data/processed/ZIPCODES_LATLON.csv',
                          header = T, sep = ',', verbose=TRUE)
# Merge data sets
weather_data <- left_join(weather_data, zip_lat_lon_data,
                          by=c("LATITUDE", "LONGITUDE"))
weather_data <- select(weather_data, -V1)
#NAs in ZIPCODE. Possibly latitude and longitude not exactly same.
#Trying to populate based on min distances.
#Get NA subset to work with
nas.lon <- weather_data$LONGITUDE[is.na(weather_data$ZIPCODE)]
nas.lat <- weather_data$LATITUDE[is.na(weather_data$ZIPCODE)]
nas <- data.frame(nas.lon, nas.lat)
nas <- nas[!duplicated(nas),]
nas <- na.omit(nas)
ref.LLZ <- as.data.frame(select(zip_lat_lon_data, LONGITUDE,
                                LATITUDE, ZIPCODE))
library("geosphere")
# Code to add zipcode in intermediate data.frame that had NA's.
# To be merged to main data1.
for (i.na in seq(1,nrow(nas))) {
  b <- c()
  for (j.ref in seq(1,nrow(ref.LLZ))) {
    a <- distGeo(nas[i.na,], ref.LLZ[j.ref, 1:2])
    b <- append(a, b)
  }
  nas[i.na,3] <- ref.LLZ[which.min(b),3]
}
colnames(nas)[1:3] <- c("LONGITUDE", "LATITUDE", "ZIPCODE")

# Apply missing values fix
# Missing values condition
cond <- (is.na(weather_data$ZIPCODE))&(!is.na(weather_data$LATITUDE))
# Split data set into "good" and "bad"
weather_data.nas <- select(weather_data[cond], -ZIPCODE)
weather_data.nas <- left_join(weather_data.nas, nas, 
                              by=c("LATITUDE", "LONGITUDE"),copy=TRUE)
weather_data.good <- weather_data[!cond]
# Rejoin them after fixing "bad" subset
weather_data <- rbind(weather_data.nas, weather_data.good)
# Do some renaming so we don't have any conlicts with main during merge
setnames(weather_data, "LONGITUDE", "W_LONGITUDE")
setnames(weather_data, "LATITUDE", "W_LATITUDE")
setnames(weather_data, "STATION_NAME", "W_STATION_NAME")
# Decided to keep only these columns; rain, snow, min/max temp
weather_data <- select(weather_data,
                       c(DATE, ZIPCODE, PRCP, SNOW, TMIN, TMAX))
#Convert temps as there is something wrong! (to Celsius)
#weather_data$TMIN <- (weather_data$TMIN/10 - 32)*0.5556
#weather_data$TMAX <- (weather_data$TMAX/10 - 32)*0.5556


data1 <- fread('./data/processed/IA_unis.csv',
               header = T, sep = ',', verbose=TRUE)


# Universities data set ---------------------------------------------------

unis <- fread('./data/processed/IA_unis.csv', header = T,
              sep = ',', verbose=TRUE)

#No need to merge with master; just for plotting
###Here the main and auxilliary data sets are merged

# 1. Merge main and auxilliary establishment data set ---------------------

data1 <- left_join(data1, est_data, by = "ZIPCODE")
# NA's present because zipcode was not in est_data3
data1 <- na.omit(data1)
#Calculate total establishments
data1$TOTAL_EST <- rowSums(data1[seq(0,nrow(data1)), 23:45,with=FALSE])

# 2. Merge main and auxilliary population data set ------------------------

# Merge population data with base
data1 <- left_join(data1, pop_data, by = "ZIPCODE")
# Write unique ZIPCODE for Python use
#write.csv2(unique(data1$ZIPCODE), file="zipcodes.csv", sep=",")

# 3. Merge main and weather data set --------------------------------------
# Firstly identify common zipcodes in two data sets
inters_zipcodes <- intersect(unique(data1$ZIPCODE),
                             unique(weather_data5$ZIPCODE))

# Split main data set based on zipcode intersection
data1.paired <- data1[which(data1$ZIPCODE==inters_zipcodes)]
data1.unpaired <- data1[!which(data1$ZIPCODE==inters_zipcodes)]
# Initialise columns for rbind
data1.unpaired$PRCP <- NA
data1.unpaired$SNOW <- NA
data1.unpaired$TMIN <- NA
data1.unpaired$TMAX <- NA
# Remove duplicate zipcode-date entries. These are valid as they are from
# different stations
weather_data <- weather_data[!duplicated(weather_data,
                                         by=c("DATE","ZIPCODE"))]
data1.paired <- left_join(data1.paired, weather_data,
                          by=c("DATE", "ZIPCODE"))
#Re-join unpaired with enriched paired
data1 <- rbind(data1.paired, data1.unpaired)


# 4. Merge main and demographics data sets --------------------------------

# I made these data sets by scraping zipatlas tables in Python
# example URL:
# http://zipatlas.com/us/ia/zip-code-comparison/
# percentage-white-population.htm

# Nationalities data sets name parts
nationalities <- c('arab', 'asian','black','chinese','czech','danish',
                   'dutch','english','filipino', 'french','german','greek',
                   'hispanic','indian','irish','italian','japanese','korean',
                   'lithuanian', 'mexican','native','norwegian','polish',
                   'portuguese','russian','scottish','slovak','swedish',
                   'swiss', 'welsh','west','white')
# Which ones to keep in the end
sub_nationalities <- nationalities[c(1:4, 7,8, 10:17, 20,21,23, 25,26,32)]
# So, because of data set size, I decided to omit individual establishments
#data and only keep totals...
data1 <- data1[,c(1:22, 46:51)]
# Run left_join.nationalities program that does all pre-processing and
#joining to main. 
for (nation in sub_nationalities){
  left_join.nationalities(nation, data1)
}

Dataset features

Initial univariate exploration

# Variables in master data set are listed below:
colnames(data1)
##  [1] "ZIPCODE"         "DATE"            "STORE"          
##  [4] "NAME"            "ADDRESS"         "CITY"           
##  [7] "STORE LOCATION"  "COUNTY NUMBER"   "CATEGORY"       
## [10] "CATEGORY NAME"   "VENDOR NO"       "VENDOR"         
## [13] "DESCRIPTION"     "PACK"            "LITER SIZE"     
## [16] "STATE BTL COST"  "BTL PRICE"       "BOTTLE QTY"     
## [19] "TOTAL"           "TOTAL_EST"       "POPULATION"     
## [22] "PRCP"            "SNOW"            "TMIN"           
## [25] "TMAX"            "Perc_arab"       "Perc_asian"     
## [28] "Perc_black"      "Perc_chinese"    "Perc_dutch"     
## [31] "Perc_english"    "Perc_french"     "Perc_german"    
## [34] "Perc_greek"      "Perc_hispanic"   "Perc_indian"    
## [37] "Perc_irish"      "Perc_italian"    "Perc_japanese"  
## [40] "Perc_mexican"    "Perc_native"     "Perc_polish"    
## [43] "Perc_russian"    "Perc_scottish"   "Perc_white"     
## [46] "BOTTLE_QTY_NORM" "TOTAL_EST_NORM"  "STORE_LAT"      
## [49] "STORE_LON"

Columns from the original liquor sales data are 1-18; others come from the auxilliary sources.

# Example row in dataset
data1[1,]
##    ZIPCODE       DATE STORE                 NAME      ADDRESS  CITY
## 1:   50002 2014-12-08  4417 KUM & GO #76 / ADAIR 109 S 5TH ST ADAIR
##                                                         STORE LOCATION
## 1: 109 S 5TH ST\nADAIR 50002\n(41.49807185700007, -94.643468852999945)
##    COUNTY NUMBER CATEGORY     CATEGORY NAME VENDOR NO
## 1:             1  1012100 CANADIAN WHISKIES       115
##                              VENDOR  DESCRIPTION PACK LITER SIZE
## 1: Constellation Wine Company, Inc. Black Velvet   12        750
##    STATE BTL COST BTL PRICE BOTTLE QTY TOTAL TOTAL_EST POPULATION PRCP
## 1:           5.23      7.85         12  94.2        22       1297   NA
##    SNOW TMIN TMAX Perc_arab Perc_asian Perc_black Perc_chinese Perc_dutch
## 1:   NA   NA   NA         0       0.37          0            0       1.46
##    Perc_english Perc_french Perc_german Perc_greek Perc_hispanic
## 1:        11.05        1.38       41.13          0          1.04
##    Perc_indian Perc_irish Perc_italian Perc_japanese Perc_mexican
## 1:        0.22       9.75         0.16          0.07         0.89
##    Perc_native Perc_polish Perc_russian Perc_scottish Perc_white
## 1:        0.07         0.4            0          1.05       98.8
##    BOTTLE_QTY_NORM TOTAL_EST_NORM STORE_LAT STORE_LON
## 1:      0.00925212     0.01696222  41.49807 -94.64347

The dataset is quite rich and holds many features of potential interest. For the purposes of this project I opted to focus on the drinking patterns of Iowans with respect to time of year and locality.

As an initial attempt at producing descriptives for the dataset, I am defining below three functions that generate histograms and calculate means and number of observations. I have split the variables into three sets; the original liquor sales variables, the added meteorological variables and the added demographic variables. For the latter two categories, the statistics were obtained on data grouped by zipcode in order to be more correct.

# plot distributions instead of showing numbers
# Function for original liquor sales variables
# y is column number of master df, x is binwidth for hist
quick_hist <- function(y, x = 0) {
  npl <- y-13
  test =data1[, y, with=F]
  test <- na.omit(test)
  x = ifelse(x==0, (max(test)-min(test))/30, x)
  nameold <- colnames(test)[1]
  namevar <- gsub(" ", "_", nameold)
  setnames(test, nameold, namevar)
  meanc = as.numeric(colMeans(test[,1,with=F], na.rm=T))
  plt <- ggplot(data = test, aes_string(x=colnames(test)[1])) +
    geom_histogram(binwidth = x) +
    geom_vline(data=test,aes_string(xintercept=as.numeric(meanc)),
               linetype="dashed", colour="red", size=1) + 
    ggtitle(paste0(as.character(npl), ". ",  
                   "# Obs: ", as.character(nrow(na.omit(test))),", ", 
                   "Mean: ", as.character(round(meanc, digits=2)))) +
    theme(plot.title=element_text(face="bold", size = 8))
  rm(test)
  return(plt)
}

# For meteorological variables - group by DATE, ZIPCODE
quick_hist_meteo <- function(y, x = 0) {
  npl <- y-13
  test=data1[,c(1,2,y), with=F]
  test <- na.omit(test)
  x = ifelse(x==0, (max(test[,3,with=F])-min(test[,3,with=F]))/30, x)
  nameold <- colnames(test)[3]
  namevar <- gsub(" ", "_", colnames(test)[3])
  setnames(test, nameold, "temp")
  test <- ddply(test, .(DATE, ZIPCODE), summarise,
                temp=mean(temp, na.rm=T))
  setnames(test, "temp", namevar)
  meanc = mean(test[,3], na.rm=T)
  plt <- ggplot(data = test, aes_string(x=colnames(test)[3])) +
    geom_histogram(binwidth = x) +
    geom_vline(data=test,aes_string(xintercept=as.numeric(meanc)),
               linetype="dashed", colour="red", size=1) + 
    ggtitle(paste0(as.character(npl), ". ",  
                   "# Obs: ", as.character(nrow(na.omit(test))),", ", 
                   "Mean: ", as.character(round(meanc, digits=2)))) +
    theme(plot.title=element_text(face="bold", size = 8))
  rm(test)
  return(plt)
}

# For demographic variables - group by DATE, ZIPCODE
quick_hist_demos <- function(y, x = 0) {
  npl <- y-13
  test <- data1[,c(1,y), with=F]
  test <- data.table(na.omit(test))
  x = ifelse(x==0, (max(test[,2,with=F])-min(test[,2,with=F]))/30, x)
  nameold <- colnames(test)[2]
  namevar <- gsub(" ", "_", colnames(test)[2])
  setnames(test, nameold, "temp")
  test <- ddply(test, .(ZIPCODE), summarise,
                temp=mean(temp, na.rm=T))
  colnames(test)[2] <- namevar
  test <- data.table(test)
  meanc = mean(test[,2], na.rm=T)
  plt <- ggplot(data = test, aes_string(x=colnames(test)[2])) +
    geom_histogram(binwidth = x) +
    geom_vline(data=test,aes_string(xintercept=as.numeric(meanc)),
               linetype="dashed", colour="red", size=1) + 
    ggtitle(paste0(as.character(npl), ". ",  
                   "# Obs: ", as.character(nrow(na.omit(test))),", ", 
                   "Mean: ", as.character(round(meanc, digits=2)))) +
    theme(plot.title=element_text(face="bold", size = 8))
  rm(test)
  rm(meanc)
  return(plt)
}

For the following plots I used multiplot; a very useful function from cookbook-r.

# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)
  
  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
  
  numPlots = length(plots)
  
  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                     ncol = cols, nrow = ceiling(numPlots/cols))
  }
  
  if (numPlots==1) {
    print(plots[[1]])
    
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
    
    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
      
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

The liquor sales variables in this case are number of bottles per ordered pack, volume of bottle in millilitres, the cost of the bottle for the state, the price it was sold to the merchant for, the total number of bottles ordered and the total money spent per order for the merchant. Let’s look at their distributions.

# Plot liquor sales stuff
multiplot(quick_hist(14) + theme(plot.title = element_text(size=6),
                                 axis.title = element_text(size=6)),
          quick_hist(15) + theme(plot.title = element_text(size=6),
                                 axis.title = element_text(size=6)),
          quick_hist(16) + theme(plot.title = element_text(size=6),
                                 axis.title = element_text(size=6)),
          quick_hist(17) + theme(plot.title = element_text(size=6),
                                 axis.title = element_text(size=6)),
          quick_hist(18) + theme(plot.title = element_text(size=6),
                                 axis.title = element_text(size=6)),
          quick_hist(19) + theme(plot.title = element_text(size=6),
                                 axis.title = element_text(size=6)),
          cols=2)

It is not really clear what’s going on so I am going to log-transform the x-axis of the cost-related variables and zoom-in closer for “PACK”, “LITER SIZE” and “BOTTLE QTY”.

# Adjust them so we see more things
multiplot(quick_hist(14, 2) + coord_cartesian(xlim=c(0, 50)), 
          quick_hist(15, 50) + coord_cartesian(xlim=c(0, 2000)), 
          quick_hist(16,0.1) + scale_x_log10(), quick_hist(17, 0.1) + scale_x_log10(), 
          quick_hist(18,2) + coord_cartesian(c(0,100)),
          quick_hist(19, 0.1) + scale_x_log10(), cols=3)

So now we can see that distributions of the cost-related have transformed to bell-shaped or normal-like. We can also clearly see the discrete nature of the other variables.

Now, let’s see descriptives of the meteorological variables; again these were derived after grouping by zipcode-day pairs.

multiplot(quick_hist_meteo(22), 
          quick_hist_meteo(23),
          quick_hist_meteo(24),
          quick_hist_meteo(25),
          cols = 2)

For precipitation and snow it’s hard to discern what’s going on so I will log-transform x-axes; for the temperatures I will just make the bins finer. Here are the plots:

multiplot(quick_hist_meteo(22, 0.1) + scale_x_log10(), 
          quick_hist_meteo(23, 0.1) + scale_x_log10(),
          quick_hist_meteo(24, 2),
          quick_hist_meteo(25, 2),
          cols = 2)

I would like to note that in general I am a bit disappointed by the completeness of the meteorological data that I ordered from the National Climatic Data Center; I also don’t understand the units used for temperatures, as any transformations I tried did not return reasonable planet Earth values..

The above plots are a bit hard to interpret as they are, because zipcodes are not represented uniformly; e.g. zipcode x may have meteorological readings for all days whereas others just for 5 days. If one desires to perform some inference on Iowa weather, the simplest way would be to generate the same plots for individual zipcodes.

Now onto the final partition of variables, the demographics. These include ethnic groups/nationalities percentages and total drinking establishments - grouped by zipcode.

# Plot demographic variables
multiplot(quick_hist_demos(20),
          quick_hist_demos(26), quick_hist_demos(27), quick_hist_demos(28),
          quick_hist_demos(29), quick_hist_demos(30), quick_hist_demos(31),
          quick_hist_demos(32), quick_hist_demos(33), quick_hist_demos(34),
          quick_hist_demos(35), quick_hist_demos(36), quick_hist_demos(39),
          quick_hist_demos(40), quick_hist_demos(41), quick_hist_demos(42), 
          quick_hist_demos(43), quick_hist_demos(44), quick_hist_demos(45),
          cols = 5)

From plots 14, 15 and 32 I guess it is clear that Iowa’s population is predominantly white. Without knowing too much about US geodemographics, presumably this is a “heartland” characteristic?

Otherwise I am surprised to see the magnitude of the German population - almost 50% on average! It can’t be true that half the Iowa population is German people, so I guess these categories pertain to “heritage” in general. Which means almost half of the Iowans are of German descent.

Also common is English, Scottish, French and Dutch heritage. In these histograms, one can see that the black and Hispanic and white group distributions exhibit long-tails, implying that there exist specific areas where these populations are much more (or less) prevalent than generally encountered in Iowa.

The “outlier” plot category in this panel is the total number of drinking establishments per zipcode - included here for thoroughness purposes.

In addition to the above, I think it would be useful to show the most popular categories of liquors and individual types.

The dataset contains 69 different categories, so it will be hard to visualise all of them in a plot. In this case just showing a table is also informative.

gr_cat <- ddply(data1, .(`CATEGORY NAME`), summarise,
                           bottles=sum(`BOTTLE QTY`, na.rm=T), price=mean(`BTL PRICE`))
arrange(gr_cat, -bottles)
##                         CATEGORY NAME bottles      price
## 1                      80 PROOF VODKA 5034321   9.175411
## 2                   CANADIAN WHISKIES 3063617  13.687787
## 3                          SPICED RUM 1781396  14.380130
## 4                     WHISKEY LIQUEUR 1092220  16.440005
## 5                             TEQUILA 1080224  20.358524
## 6                    BLENDED WHISKIES 1069756  10.400273
## 7           STRAIGHT BOURBON WHISKIES 1067546  17.220917
## 8                      IMPORTED VODKA 1008599  20.268337
## 9    PUERTO RICO & VIRGIN ISLANDS RUM  992118  11.405754
## 10                     FLAVORED VODKA  966793  10.935803
## 11            AMERICAN GRAPE BRANDIES  689098   8.059325
## 12                 TENNESSEE WHISKIES  688586  20.747060
## 13                  AMERICAN DRY GINS  613832   9.085220
## 14                       FLAVORED RUM  538311  12.678766
## 15                 AMERICAN COCKTAILS  534658  10.689236
## 16 MISC. IMPORTED CORDIALS & LIQUEURS  487998  21.449140
## 17              IMPORTED VODKA - MISC  474696  16.909476
## 18                     CREAM LIQUEURS  431367  17.743147
## 19            IMPORTED GRAPE BRANDIES  372098  21.058585
## 20                  IMPORTED SCHNAPPS  349139  16.437759
## 21                    SCOTCH WHISKIES  301218  23.689421
## 22 MISC. AMERICAN CORDIALS & LIQUEURS  253863  11.564033
## 23        DISTILLED SPIRITS SPECIALTY  225048  15.317123
## 24                         TRIPLE SEC  224300   3.974883
## 25     DECANTERS & SPECIALTY PACKAGES  216893  24.330833
## 26                PEPPERMINT SCHNAPPS  213255   7.174323
## 27                  IMPORTED DRY GINS  212966  22.151287
## 28                     IRISH WHISKIES  208518  25.281346
## 29                     PEACH SCHNAPPS  151208   9.646169
## 30                    COFFEE LIQUEURS  136482  15.525798
## 31              STRAIGHT RYE WHISKIES  126953  26.339466
## 32                BLACKBERRY BRANDIES  121103   9.163894
## 33                  AMERICAN AMARETTO  119461   6.739355
## 34                 SINGLE MALT SCOTCH   87382  43.527749
## 35                     APPLE SCHNAPPS   65761  10.148421
## 36                   AMERICAN ALCOHOL   57573  13.408440
## 37              BUTTERSCOTCH SCHNAPPS   57166   9.235604
## 38                  CINNAMON SCHNAPPS   48640  11.526123
## 39                   APRICOT BRANDIES   47441   9.035857
## 40                WATERMELON SCHNAPPS   39578  10.514740
## 41             MISCELLANEOUS SCHNAPPS   36833  12.033186
## 42                     GRAPE SCHNAPPS   34779  10.654928
## 43                  IMPORTED AMARETTO   25090  20.526154
## 44                       BARBADOS RUM   24914  15.059416
## 45                 ROOT BEER SCHNAPPS   24296   9.322246
## 46                     PEACH BRANDIES   23588   7.696805
## 47                    CHERRY BRANDIES   21856   7.767895
## 48                    100 PROOF VODKA   19816  14.089263
## 49                        JAMAICA RUM   19799  16.038717
## 50                STRAWBERRY SCHNAPPS   19578   7.720291
## 51                 RASPBERRY SCHNAPPS   17918   8.544250
## 52                       FLAVORED GIN   17434  10.538608
## 53            TROPICAL FRUIT SCHNAPPS   14127   7.256225
## 54              GREEN CREME DE MENTHE   11782   6.934776
## 55     SINGLE BARREL BOURBON WHISKIES    9574  34.392213
## 56                 AMERICAN SLOE GINS    8731   6.828953
## 57               WHITE CREME DE CACAO    8543   6.967944
## 58                DARK CREME DE CACAO    8263   6.967815
## 59                    LOW PROOF VODKA    5690  12.502111
## 60             MISCELLANEOUS BRANDIES    4580  24.599805
## 61                 SPEARMINT SCHNAPPS    4313   7.203551
## 62                  OTHER PROOF VODKA    3876  11.962781
## 63                         ROCK & RYE    3611  10.966809
## 64            BOTTLED IN BOND BOURBON    3070  14.623684
## 65              WHITE CREME DE MENTHE    2839   6.999466
## 66                                       1755  33.695673
## 67                    CREME DE ALMOND    1710   7.001005
## 68                           ANISETTE    1491   6.927277
## 69                    HIGH PROOF BEER      38 109.174286

We see that 80 proof vodka is the most popular liquor type in Iowa by some margin. Canadian whiskies are also very popular. Also popular are rums, cheap whiskies and tequila. There also exists a category without a name which has a relatively high average price.

On the contrary, the least popular liquors are high proof beer (only 38 bottles sold - a categorisation outlier? A one-off order?) and liqueurs like Anisette (I imagine this is like ouzo), crème liqueurs, and spearmint schnapps (peppermint schnapps on the other hand is quite popular - both must taste like cough medicine!).

Let’s see the most and least popular individual liquors.

gr_item <- ddply(data1, .(DESCRIPTION), summarise,
                 bottles=sum(`BOTTLE QTY`, na.rm=T), price=mean(`BTL PRICE`))
#Most popular
arrange(gr_item, -bottles)[1:30,]
##                              DESCRIPTION bottles     price
## 1                           Black Velvet 1422405 10.191468
## 2                          Hawkeye Vodka 1036303  7.313998
## 3              Captain Morgan Spiced Rum  656577 17.603547
## 4              Fireball Cinnamon Whiskey  572895 15.620919
## 5          Jack Daniels Old #7 Black Lbl  473872 25.320187
## 6                           Five O'clock  466019  7.092719
## 7                           Barton Vodka  398614  8.211223
## 8                    Mccormick Vodka Pet  371391  6.719349
## 9                  Smirnoff Vodka 80 Prf  365132 11.796400
## 10          Absolut Swedish Vodka 80 Prf  320092 19.418173
## 11           Crown Royal Canadian Whisky  318719 25.130260
## 12                        Phillips Vodka  307152  9.240963
## 13                  Bacardi Superior Rum  301426 13.453927
## 14           Seagrams 7 Crown Bl Whiskey  300828 12.463819
## 15                  Jagermeister Liqueur  287670 17.999875
## 16                       Mccormick Vodka  278271  3.637401
## 17       Paul Masson Grande Amber Brandy  275110  5.361728
## 18                              Jim Beam  269741 16.211123
## 19             Admiral Nelson Spiced Rum  264521 12.053835
## 20                             Five Star  260876  7.090375
## 21 Jose Cuervo Especial Reposado Tequila  242899 16.597364
## 22                   Paramount White Rum  239648  9.024818
## 23                   Canadian Ltd Whisky  235012  9.412726
## 24                      Southern Comfort  229259 12.327208
## 25               Seagram's Extra Dry Gin  223595  8.402159
## 26        Captain Morgan Original Spiced  223252 13.245653
## 27                    Five O'clock Vodka  220975  7.155968
## 28                      Grey Goose Vodka  219106 24.536126
## 29                    Malibu Coconut Rum  215779 14.376951
## 30             Uv Blue (raspberry) Vodka  194191 11.437719

Surprisingly I’ve never noticed most of these spirits at bars in Europe. Altogether though this list was what I expected: vodkas, whiskeys and rums and there are a few global household names in there such as Captain Morgan, Jack Daniels, Smirnoff, Jagermeister, etc.. One thing is for sure though.. Iowans love vodka and particularly the cheaper local variety! I’m also a bit curious regarding the popularity of cinnamon whiskey. Maybe a success with the student population?

#Least popular
arrange(gr_item, bottles)[1:30,]
##                                     DESCRIPTION bottles   price
## 1                             Calvados Morin XO       1   70.50
## 2                 Framboise (Raspberry Liqueur)       1   45.89
## 3                              Glendronach 18yo       1  109.88
## 4                John J Bowman Str. Bbn.Why. HA       1   23.40
## 5                    Calvados Morin Hors d' Age       2   54.00
## 6              Cedar Ridge Barrel Proof Bourbon       2 8700.00
## 7                                Espolon Blanco       2   20.13
## 8                    Fernandaez Brandy De Jerez       2   54.96
## 9             Glendronach 14yr Virgin Oak - DNO       2   69.30
## 10                    Grappa di Recioto Amarone       2  117.36
## 11          Hirsch Straight Bourbon Small Batch       2   33.00
## 12                              Jameson 12YR HA       2   44.99
## 13                       Johnnie Walker Odyssey       2  747.96
## 14 Macallan Sherry Cask 25 Year Old Scotch - HA       2  845.10
## 15                    New Amsterdam Citron Mini       2    6.93
## 16                            Sam Adams Utopias       2  118.96
## 17   Seagram's Seven Crown American Spiced Mini       2    9.26
## 18                        Taaka King Cake Vodka       2    6.00
## 19                Tooters 50ml Marg, OTB, Mango       2   65.25
## 20                                  Toucano Rum       2   26.04
## 21                        Uncle Bob's Root Beer       2   15.38
## 22                  1800 Ultimate Blueberry RTD       3   14.25
## 23                            Glenmorangie 25yr       3  431.99
## 24               Hiram Walker Blue Curacao 3pak       3    5.85
## 25              Johnnie Walker King George V HA       3  444.98
## 26          Libertine Absinthe(french Absinthe)       3   64.13
## 27        Seagram's Seven Crown American Spiced       3   10.50
## 28        Admiral Nelson Cherry Spiced Rum Mini       4    5.66
## 29                            Ciroc Luxury Mini       4   44.93
## 30                    Irishman Founders Reserve       4   25.01

All items in the top 30 least popular list are in a sense outliers as at most 4 bottles have been ordered altogether. These probably correspond to single orders, and it is likely that some kind of data recording inconsistency has taken place (e.g. product description entered somehow differently).

Interestingly however, a few rather expensive liquors are included in this list - mainly bourbons and whiskeys. For example, two orders were placed for “Cedar Ridge Barrel Proof Bourbon” - with a bottle price of 8700 (!). It turns out that two separate orders were placed on 27/10 and 29/10 by two separate “Hy-Vee Food” stores in different zipcodes in Cedar Rapids. Interesting - anyone’s guess really as to the background story. It has to be a special order for a specific customer (maybe they broke the first bottle? An expensive mistake!).

Initial multivariate exploration

In this part I intended to show some exploratory scatter plots between variables in order to visually identify any stand-out associations, by making use of the great ggpairs function of GGally. There are a couple of issues though. For one, the number of observations is quite large (~2.5m) and also, the dataset contains 49 variables. There is no way ggpairs will produce a sensible display so we need to be a bit clever and restrict what we feed it.

As a first attempt, let’s look at associations between proportions of national heritage group proportions.

This one is fairly easy to display, as observations are unique per zipcode (thus restricting the number of observations to about 350 - the number of unique zipcodes) and I have included 20 relevant variables in the data. Let’s see what we get.

gr_zip_pop <- ddply(data1[,c(1,24, seq(29,68,2), 72, 73), with=F], .(ZIPCODE), 
                numcolwise(mean, na.rm=T))
gr_zip_pop$ZIPCODE <- factor(gr_zip$ZIPCODE, ordered=F)
ggpairs(gr_zip_pop, columns=3:22, axisLabels = 'internal',
        upper = "blank", lower=list(params=list(size=1)),
        params = c(Shape = I("."), outlier.shape = I(".")), 
        title="Scatter plots between national heritage group proportions") + 
  theme(legend.position = "none",
        panel.grid.major = element_blank(),
        axis.ticks = element_blank(),
        axis.text = element_blank())

The display is a bit compressed but we can still observe positive linear correlations between hispanic and mexican, english and scottish, english and irish, french and irish. We can also spot a few negative correlations such as german and english, german and french, black and white, asian and white. I think these displays aren’t very informative.

Next, let’s look at scatter plots between the main variables of interest in the original liquor sales data. I will perform the analysis on a reasonably sized subset of the data by restricting to a single zipcode (50314 - the highest populated zipcode in Des Moines).

pairs2 <- data1[data1$ZIPCODE==50314][,c(14, 15,16,17,18,19),with=F]
#ggpairs does not like spaces in variable names
setnames(pairs2, "STATE BTL COST", "STATE_BTL_COST")
setnames(pairs2, "LITER SIZE", "LITER_SIZE")
setnames(pairs2, "BTL PRICE", "BTL_PRICE")
setnames(pairs2, "BOTTLE QTY", "BOTTLE_QTY")
ggpairs(pairs2, axisLabels = 'internal',
        upper = "blank", lower=list(params=list(size=1)),
        params = c(Shape = I("."), outlier.shape = I("."))) + 
  theme(legend.position = "none",
        panel.grid.major = element_blank(),
        axis.ticks = element_blank(),
        axis.text = element_blank())

So the raw output here is not very informative and in order to see more one has to adjust axes, convert scales, etc. This will take place for variables of interest later on in this work. But a quick scan through the plots shows a very strong linear correlation between state buying and selling price of items; this implies a fixed profit (in terms of percentage) is set by the state. Also correlated, as expected, are bottle quantities sold per transaction and total money spent per transaction.

I spent some time trying some variable groupings with ddply in order to find some interesting combinations to input to ggpairs but was unable to produce any particularly interesting visualisations at this preliminary exploratory stage. I think that due to plot size restrictions, ggpairs is not the best tool to use for a dataset of this magnitude in terms of number of observations and variables. I also found it was quite hard to find a way to adjust font sizes in the various sections of the output - especially on the diagonal.

Analysis

Demographics

In the following plots we will perform a high-level analysis of basic demographics.

Firstly let’s look at population per zipcode, so we need to group accordingly (we’ll use the gr_zip_pop variable we defined earlier on).

#Prep for map plotting
#Get map of Iowa using RgoogleMaps and ggmap
CenterOfMap <- geocode("41.9137948,-93.3293731")
Iowa <- get_map(c(lon=CenterOfMap$lon, lat=CenterOfMap$lat), zoom = 7,
                   source = "stamen", maptype= "toner", color="bw")
IowaMap <- ggmap(Iowa)

So now we’ve grouped per zipcode and pulled the population.

I realised here that I had not actually merged the coordinates of “zipcode centre” to the master. I got around this by calculating the mean latitude and longitude of stores in the zipcode. Below lies a the map of Iowa where we overlay population bubbles per zipcode and also the location of universities; here the triangle size indicates number of students enrolled.

# Explore demographics ----------------------------------------------------
suppressWarnings(IowaMap + 
  geom_point(data=gr_zip_pop, aes(x=STORE_LON, y=STORE_LAT, size=POPULATION),
             colour="red", alpha=0.5) +
  geom_point(data = unis, aes(x=lon, y=lat, size=enrolled), colour="blue",
             shape=17, alpha=0.5) + 
  theme(legend.position="none", axis.title.x=element_blank(),
        axis.title.y=element_blank(), axis.ticks=element_blank(),
        axis.text.x=element_blank(), axis.text.y=element_blank(),
        plot.title = element_text(size = rel(1), colour = "blue")) +
  labs(title="Map of Iowa with population and university overlays"))

We can see that Davenport and Iowa city are big University cities. Regarding population, Des Moines, Waterloo, Cedar Rapids, Dubuque and Sioux City are all cities in which certain areas (zipcodes) are highly populated. It will be more useful of course to look at overall city population and that is what we are going to do now.

gr_city_pop <- ddply(data1, .(ZIPCODE, CITY), summarise, 
                     POPULATION = mean(POPULATION, na.rm=T), 
                     STORE_LAT = mean(STORE_LAT, na.rm = T),
                     STORE_LON = mean(STORE_LON, na.rm = T))
gr_city_pop <- ddply(gr_city_pop, .(CITY), summarise, 
                     TOT_POP = sum(POPULATION, na.rm = T),
                     LAT = mean(STORE_LAT, na.rm = T), 
                     LON = mean(STORE_LON, na.rm =T))
IowaMap + 
  geom_point(data=gr_city_pop, aes(x=LON, y=LAT, size=TOT_POP),
             colour="purple", alpha=1) +
  theme(legend.position="none", axis.title.x=element_blank(),
        axis.title.y=element_blank(), axis.ticks=element_blank(),
        axis.text.x=element_blank(), axis.text.y=element_blank(),
        plot.title = element_text(size = rel(1), colour = "blue")) +
  labs(title="Map of Iowa with city population overlay")

And county populations below..

gr_county_pop <- ddply(data1, .(ZIPCODE, COUNTY), summarise, 
                     POPULATION = mean(POPULATION, na.rm=T), 
                     STORE_LAT = mean(STORE_LAT, na.rm = T),
                     STORE_LON = mean(STORE_LON, na.rm = T))
gr_county_pop <- ddply(gr_county_pop, .(COUNTY), summarise, 
                     TOT_POP = sum(POPULATION, na.rm = T),
                     LAT = mean(STORE_LAT, na.rm = T), 
                     LON = mean(STORE_LON, na.rm =T))
IowaMap + 
  geom_point(data=gr_county_pop, aes(x=LON, y=LAT, size=TOT_POP),
             colour="blue", alpha=1) +
  theme(legend.position="none", axis.title.x=element_blank(),
        axis.title.y=element_blank(), axis.ticks=element_blank(),
        axis.text.x=element_blank(), axis.text.y=element_blank(),
        plot.title = element_text(size = rel(1), colour = "blue")) +
  labs(title="Map of Iowa with county population overlay")

I think it’s clear that the centre of Iowa is Des Moines. There are some highly populated areas along the county’s contour, such as Sioux City, Omaha and Dubuque.

Before moving on to the liquor sales analysis I would like to take a look at national group populations.

#My save-space function for plotting
plot_map_nat_over <- function(y){
  x <- colnames(gr_zip_pop)[y]
  IowaMap + 
    geom_point(data=gr_zip_pop, aes_string(x="STORE_LON", y="STORE_LAT",
                                           colour=x, alpha=x)) +
    scale_colour_gradientn(colours = topo.colors(10), limits=c(0,50)) +
    theme(legend.position="none", axis.title.x=element_blank(),
          axis.title.y=element_blank(), axis.ticks=element_blank(),
          axis.text.x=element_blank(), axis.text.y=element_blank(),
          plot.title = element_text(size = rel(1), colour = "blue")) +
    labs(title=x)
}

In the composite plot below I am producing something like a heat map for population percentages per zipcode for a selection of nationalities; “cold” is blue and “hot” is yellow/brown (I found it hard to overlay a global legend.)

#Produce maps with overlays for all nationalities

suppressWarnings(multiplot(plot_map_nat_over(3), plot_map_nat_over(6), plot_map_nat_over(7), 
          plot_map_nat_over(8), plot_map_nat_over(9), plot_map_nat_over(10),
          plot_map_nat_over(11), plot_map_nat_over(12), plot_map_nat_over(13),
          plot_map_nat_over(14), plot_map_nat_over(15), plot_map_nat_over(16),
          plot_map_nat_over(17), plot_map_nat_over(18), plot_map_nat_over(19),
          plot_map_nat_over(20), plot_map_nat_over(21),
          cols=5))

And finally

#Add establishments variable to zipcode grouped data
test <- unique(cbind(data1$ZIPCODE, data1$TOTAL_EST))
test <- data.table(test)
setnames(test, c(1,2), c("ZIPCODE", "TOTAL_EST"))
test$ZIPCODE <- factor(test$ZIPCODE)
gr_zip_pop <- left_join(gr_zip_pop, test)
rm(test)
#Plot
suppressWarnings(IowaMap + 
  geom_point(data=gr_zip_pop, aes(x=STORE_LON, y=STORE_LAT, size=TOTAL_EST),
             colour="green", alpha=0.5) +
  theme(legend.position="none", axis.title.x=element_blank(),
        axis.title.y=element_blank(), axis.ticks=element_blank(),
        axis.text.x=element_blank(), axis.text.y=element_blank(),
        plot.title = element_text(size = rel(1), colour = "blue")) +
  labs(title="Map of Iowa with establishment overlays"))

OK now let’s move on to liquor sales.

Liquor sales

Auditing data

Let’s audit the liquor sales data a little bit first just to make sure it is trustworthy.

Firstly some data prep:

#Make DF summarising reports for each store each month
gr_st_month_count <- ddply(data1, .(months(DATE), STORE), summarise, 
              count_reps=length(STORE))
setnames(gr_st_month_count,1, "month")
gr_st_month_count$month <- factor(gr_st_month_count$month,
                                  levels=month.name)

# Make separate DF counting number of participating stores per month
gr_st_month.part_stores <- data.frame("month"=factor(rownames(
  table(gr_st_month_count$month)),levels=month.name),
  "part_stores"=as.numeric(table(gr_st_month_count$month)))

#Combine data sets for ggplot use
gr_st_month_count <- left_join(gr_st_month_count, gr_st_month.part_stores)
rm(gr_st_month.part_stores)

And on to a boxplot which illustrates the monthly distribution of the number of transactions recorded in the database; I also incorporated information for the number of stores appearing in the data on a monthly basis as a colour variable.

#Number of reports per month and num. of participating stores overlay
ggplot(data=gr_st_month_count, aes(x=month, y=count_reps)) + 
  stat_boxplot(geom='errorbar') + geom_jitter(alpha=0.2) + 
  geom_boxplot(aes(fill=part_stores), outlier.size=0) +
  coord_cartesian(ylim = c(0, 1000)) +
  theme(panel.background = element_rect(fill = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        plot.title = element_text(size = rel(1), colour = "blue")) + 
  labs(x="Month", y = "Transactions", 
title = "Monthly Transactions between Iowa State and Liquor Stores")

We see a drastic reduction in the number of reports; a step change occurring in June. The subsequent plots presenting the same data grouped by week.

# Get weekly reports as per month methodology

gr_st_week_count <- ddply(data1, .(format(DATE, "%U"), STORE), summarise, 
                             count_reps=length(STORE))
setnames(gr_st_week_count,1, "week")
gr_st_week_count$week <- as.numeric(gr_st_week_count$week)
gr_st_week.part_stores <- data.frame("week"=rownames(table(gr_st_week_count$week)),
  "part_stores"=as.numeric(table(gr_st_week_count$week)))
gr_st_week.part_stores$week <- as.numeric(as.character(gr_st_week.part_stores$week))
gr_st_week_count <- left_join(gr_st_week_count, gr_st_week.part_stores)
#changed my mind. will make week factor
gr_st_week_count$week <- factor(gr_st_week_count$week)
rm(gr_st_week.part_stores)
#Plot: same as plot 1 but using boxplot colour as part. stores variable
ggplot(data=gr_st_week_count, aes(x=week, y=count_reps)) + 
  stat_boxplot(geom='errorbar') + geom_jitter(alpha=0.2) + 
  geom_boxplot(aes(fill=part_stores), outlier.size=0) +
  coord_cartesian(ylim = c(0, 250)) +
  theme(panel.background = element_rect(fill = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        plot.title = element_text(size = rel(1), colour = "blue")) + 
  labs(x="Week", y = "Transactions", 
       title = "Weekly Transactions between Iowa State and Liquor Stores")

Here we see that the step change occurs on the 1st week of June. Which is suspicious. Let’s zoom in on daily data for that period.

gr_st_day_count <- ddply(filter(data1, (DATE <= "2014-06-08") & 
                                  (DATE >="2014-05-25")), .(DATE, STORE),
                         summarise, count_reps=length(STORE))
setnames(gr_st_day_count,1, "day")
gr_st_day.part_stores <- data.frame("day"=rownames(table(gr_st_day_count$day)),
                                     "part_stores"=as.numeric(table(gr_st_day_count$day)))
gr_st_day.part_stores$day <- as.Date(as.character(gr_st_day.part_stores$day))
gr_st_day_count <- left_join(gr_st_day_count, gr_st_day.part_stores)
rm(gr_st_day.part_stores)
ggplot(data=gr_st_day_count, aes(x=as.character(day), y=count_reps)) + 
  stat_boxplot(geom='errorbar') + geom_jitter(alpha=0.2) + 
  geom_boxplot(aes(fill=part_stores), outlier.size=0) +
  coord_cartesian(ylim = c(0, 250)) +
  theme(panel.background = element_rect(fill = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        plot.title = element_text(size = rel(1), colour = "blue")) + 
  labs(x="Day", y = "Transactions", 
       title = "Daily Transactions between Iowa State and Liquor Stores")

So it’s actually the first day of June that the number of reports reduces dramatically. I performed a quick Google search on whether anything changed drastically in terms of Iowa liquor sales legislation, or if anything major happened but to no avail. Furthermore, I believe this is not a seasonal effect (end of Univ. semesters, basketball/football end of season) because the number of reports:

  • Falls too abruptly
  • Does not return to previous levels later in the year

Therefore I feel that this pattern has something to do with data collection/recording.

One thing I observed was that the fiscal year for the relevant Iowa state division runs from July to June. So this dataset partially contains data from two fiscal years (January to June: 2013, July to December: 2014). Could this mean something?

Anyway, I decided not to pursue this investigation any further; it is clear that the dataset has limitations and it is probably best to analyse these two periods independently (at least for time series analysis).

Before I move on to the main body of the analysis, let’s look at the distribution of transactions per weekday.

#Weekday
gr_st_weekday_count <- ddply(data1, .(weekdays(DATE), STORE), summarise, 
                             count_reps=length(STORE))
setnames(gr_st_weekday_count,1, "weekday")
day.name <- c("Monday", "Tuesday", "Wednesday", "Thursday", 
              "Friday", "Saturday", "Sunday")
gr_st_weekday_count$weekday <- factor(gr_st_weekday_count$weekday, levels=day.name)
gr_st_weekday.part_stores <- data.frame("weekday"=rownames(table(gr_st_weekday_count$weekday)),
                                    "part_stores"=as.numeric(table(gr_st_weekday_count$weekday)))
gr_st_weekday.part_stores$weekday <- factor(gr_st_weekday.part_stores$weekday, levels=day.name)
gr_st_weekday_count <- left_join(gr_st_weekday_count, gr_st_weekday.part_stores)
rm(gr_st_weekday.part_stores)
#Plot data
ggplot(data=gr_st_weekday_count, aes(x=weekday, y=count_reps)) + 
  stat_boxplot(geom='errorbar') + geom_jitter(alpha=0.2) + 
  geom_boxplot(aes(fill=part_stores), outlier.size=0) +
  coord_cartesian(ylim = c(0, 4500)) +
  theme(panel.background = element_rect(fill = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        plot.title = element_text(size = rel(1), colour = "blue")) + 
  labs(x="Weekday", y = "Transactions", 
       title = 
         "Number of orders submitted per weekday")

Here we see that most orders are placed on Wednesdays and Thursdays.

Analysis of liquor sales data

We saw that time series analyses on the full data-set are probably not suitable.

Let’s look at various dataset properties.

Plotting the distribution of bottles sold per transaction.

ggplot(data=data1, aes(x=`BOTTLE QTY`)) +
  geom_histogram(aes(y =..density..),
                 fill=I("blue"), col=I("black"), alpha=0.5, origin = -0.5) +
  labs(title="Histogram for Bottle Quantities", x="Bottles", y="Count") +
  theme(panel.background = element_rect(fill = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        axis.ticks=element_blank(),
        plot.title = element_text(size = rel(1), colour = "blue"))

Not easy to see shape of distribution. Looking at log-transform.

bks <- seq(min(data1$`BOTTLE QTY`, na.rm=T), max(data1$`BOTTLE QTY`, na.rm=T))
ggplot(data=data1, aes(x=`BOTTLE QTY`)) +
  geom_histogram(aes(y =..density..),
                 fill=I("blue"), col=I("black"), alpha=0.5, origin = -0.5) +
  labs(title="Histogram for Log Bottle Quantities", x="Bottles", y="Count") +
  scale_x_log10(breaks=bks, labels=NULL) +
  theme(panel.background = element_rect(fill = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        axis.ticks=element_blank(),
        plot.title = element_text(size = rel(1), colour = "blue"))

The log transform shows that the amount of bottles sold per transaction is to a large extent “quantised” (or non-uniform?). Let’s focus on the denser area of the distribution with un-transformed axes.

ggplot(data=data1, aes(x=`BOTTLE QTY`)) +
  geom_histogram(aes(y =..density..),
                 fill=I("blue"), col=I("black"), alpha=0.5, origin = -0.5) +
  labs(title="Histogram for Bottle Quantities", x="Bottles", y="Count") +
  theme(panel.background = element_rect(fill = 'white'),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        axis.ticks=element_blank(),
        plot.title = element_text(size = rel(1), colour = "blue")) + 
  scale_x_continuous(limits=c(0,50), breaks=seq(0,50,2))

So here we see that the mode of the distribution is 12; most orders are submitted for a dozen bottles of some liquor. We also see that multiples of 12 are also high occurring, as well as 6 (half-a-dozen).

Now I would like to look at total bottle sales as time-series, so let’s create appropriate variables for the period of January to May.

gr_date.Jan_May <- ddply(filter(data1, (DATE < "2014-06-01")),
                     .(DATE), summarise, BOTTLES = sum(`BOTTLE QTY`, na.rm=T), 
                     BOTTLES_NORM = sum(`BOTTLE QTY`/POPULATION, na.rm=T))

The above data frame contains three variables: date, total bottles sold and normalised bottles sold. The latter variable is an attempt to adjust for the population bias. This effect arises because we aggregate data from variably populated areas in Iowa and can potentially obscure meaningful relationships in the analysis of the data. The following code rearranges the data frame a bit to help out ggplot2.

test <- gr_date.Jan_May
maxbot <- max(gr_date.Jan_May$BOTTLES, na.rm=T)
maxbotnorm <- max(gr_date.Jan_May$BOTTLES_NORM, na.rm=T)
test<- arrange(gather(test, "type", "bottles", 2:3), DATE)
test <- mutate(test, maxbottles =ifelse(type=="BOTTLES", maxbot, maxbotnorm))
gr_date.Jan_May <- test
rm(test)

And the plot follows..

ggplot(data=gr_date.Jan_May,aes(x=DATE, y=bottles/maxbottles, colour=type)) + 
  geom_point()+ geom_line() +
                            geom_ribbon(aes(ymin=0, ymax=bottles/maxbottles, fill=type)) +
  scale_fill_brewer(palette="Set3")+
  labs(x="Date", y = "Bottles")+
  theme(panel.background = element_rect(fill = 'white'),
        plot.title = element_text(size = rel(1), colour = "blue")) 

When we look at the raw time series of the quantity of bottles ordered, we see some type of seasonality but it is not easy to interpret. However, once we attenuate the population effect, we see a very stable weekly pattern spiking on a single weekday. Let’s look at this data in more detail.

p1 <- ggplot(data=filter(gr_date.Jan_May, type=="BOTTLES_NORM"),
       aes(x=factor(weekdays(DATE), levels=day.names),
                             y=bottles)) +geom_boxplot()+
  labs(x="Weekday", y="Bottles (pop bias removed)")

p2 <- ggplot(data=filter(gr_date.Jan_May, type=="BOTTLES_NORM"), 
       aes(x=format(DATE, "%U"),
                             y=bottles)) +geom_boxplot()+
  labs(x="Week", y="Bottles (pop bias removed)")
  
p3 <- ggplot(data=filter(gr_date.Jan_May, type=="BOTTLES_NORM"),
       aes(x=factor(months(DATE), levels=month.name),
                             y=bottles)) +geom_boxplot()+
  labs(x="Month", y="Bottles (pop bias removed)")
multiplot(p1, p2, p3, cols=2)

rm(p1,p2,p3)

So in the weekday plot we see more clearly this stable pattern in quantities of bottles ordered per citizen. Tuesday is the “spike” day, Monday and Wednesday are pretty much the same; after Wednesday the quantities decline. In my mind this pattern makes sense: Monday is the first day of the week and the stores calculate their weekly requirements by the end of the day. They then submit the order either at end of business or early Tuesday; this allows time for the bottles to be delivered prior to the weekend. They shouldn’t need to order too many bottles after the early week’s budgeting.

It is a bit counter-intuitive though that in a previous plot we saw that the number of orders per weekday (as opposed to quantity of bottles ordered per weekday) was greatest on Wednesday’s and Thursday’s. Maybe if I had corrected for the population bias, a similar trend would have been revealed…

One last note on the previous weekday analysis. Let’s plot the raw and population-adjusted quantities against each other and see what we get. I hope it will show a bit more on how the normalisation helped the analysis.

ggplot(data=gr_date.Jan_May,aes(x=bottles[type=="BOTTLES"]/maxbottles[type=="BOTTLES"], 
                                y=bottles[type=="BOTTLES_NORM"]/maxbottles[type=="BOTTLES_NORM"],
                                shape=weekdays(DATE[type=="BOTTLES"]),
                                colour=weekdays(DATE[type=="BOTTLES"]))) +
  geom_point(size=3) + labs(x="raw quantities", y="population adjusted quantities")

I like this plot because it looks like a machine learning exercise.. We see here that each day forms an (almost) discrete cluster, with Tuesdays and Fridays having maximum separation from the main body of observations; however Fridays would belong to a high R-squared fitter regression line of all observations excluding Tuesdays. So I think the real outlier here is Tuesdays, in the sense that the normalisation added high information value for that weekday.

Bottle price vs total sales

In this small section I will group data per item (liquor) and plot the relationship between number of bottles ordered and bottle price.

We’ve already generated the relevant data frame gr_item.

Here is the raw plot…

#plot raw
ggplot(data=gr_item, aes(x=price, y=bottles)) + geom_point()

We don’t see too much. In these cases it is often helpful to log-transform axes.

#plot log-transformed
#good relationship
ggplot(data=gr_item, aes(x=price, y=bottles)) + geom_point(alpha=0.3) +
  scale_y_log10() + scale_x_log10()+ geom_smooth(fill="red") 

Let’s focus on the main body of observations. The function that adds the annotation to the following plot was taken from stackoverflow..

#Define lm_eqn function that adds model annotation
lm_eqn = function(m) {
  
  l <- list(a = format(coef(m)[1], digits = 2),
            b = format(abs(coef(m)[2]), digits = 2),
            r2 = format(summary(m)$r.squared, digits = 3));
  
  if (coef(m)[2] >= 0)  {
    eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,l)
  } else {
    eq <- substitute(italic(y) == a - b %.% italic(x)*","~~italic(r)^2~"="~r2,l)    
  }
  
  as.character(as.expression(eq));                 
}
#Plot
ggplot(data=gr_item, aes(x=price, y=bottles)) + geom_point(alpha=0.3) +
  scale_y_log10() + scale_x_log10(limits=c(5,250))+ geom_smooth(method="lm",fill="red") +
  geom_text(x = log10(150), y = log10(1e+05), 
            label = lm_eqn(lm(log10(bottles)~log10(price),gr_item)),
            parse = TRUE)

There is a a weak association (R-squared 0.13) between bottle price and quantities ordered. Therefore, from the plots above one cannot detect a simple function that links price and demand, however it is obvious that the relationship is inverse.

Relationships between order quantities of different liquors

In this small section I would like to look at order quantities as time series (note: using full dataset for this part). Let’s calculate our data frame first, where we group by month and liquor category, calculate total bottles ordered per category/month and derive from this a relevant normalised variable representing percentages over total bottles ordered.

gr_date_cat2 <- ddply(data1, .(DATE, `CATEGORY NAME`), summarise, 
                      BOTTLES = sum(`BOTTLE QTY`, na.rm=T))
gr_date_cat2.month <- ddply(gr_date_cat2, .(factor(months(DATE), levels=month.name),
                                    `CATEGORY NAME`), summarise,
                            BOTTLES = sum(`BOTTLES`, na.rm=T))
setnames(gr_date_cat2.month, 1, "Month")
gr_date_cat3 <- ddply(gr_date_cat2.month, .(Month), summarise,
                      TOT_BOTT = sum(BOTTLES, na.rm=T))

gr_date_cat2.month <- left_join(gr_date_cat2.month, gr_date_cat3)
gr_date_cat2.month$BCAT_NORM <- gr_date_cat2.month$BOTTLES/gr_date_cat2.month$TOT_BOTT
setnames(gr_date_cat2.month, 2, "CATEGORY")

# Some necessary manipulation for ggplot2 intricacies
dec_dates <- decimal_date(as.Date(c("2014-01-01", "2014-02-01", "2014-03-01",
                                    "2014-04-01", "2014-05-01", "2014-06-01",
                                    "2014-07-01", "2014-08-01", "2014-09-01",
                                    "2014-10-01", "2014-11-01", "2014-12-01")))

gr_date_cat2.month <- mutate(
  gr_date_cat2.month, month_nom = ifelse(Month=="January", dec_dates[1], ifelse(Month=="February", dec_dates[2], ifelse(Month=="March", dec_dates[3], ifelse(Month=="April", dec_dates[4], ifelse(Month=="May", dec_dates[5], ifelse(Month=="June", dec_dates[6], ifelse(Month=="July", dec_dates[7], ifelse(Month=="August", dec_dates[8], ifelse(Month=="September", dec_dates[9], ifelse(Month=="October", dec_dates[10], ifelse(Month=="November", dec_dates[11], ifelse(Month=="December", dec_dates[12], 0)))))))))))))
gr_date_cat2.month[gr_date_cat2.month$CATEGORY=="",]$CATEGORY <- "UNKNOWN"

In the subsequent plots I show time series of normalised bottle quantities of orders for each liquor category per month; I have added a colour overlay of season (winter, spring,…) in order to discern seasonal patterns more easily.

# Settings for ggplot2
txmin <- decimal_date(as.Date(c("2014-01-01", "2014-03-01", "2014-06-01","2014-09-01", "2014-12-01")))
txmax <- c(decimal_date(as.Date(c("2014-03-01", "2014-06-01", "2014-09-01","2014-12-01"))), Inf)
tymin <- rep(-Inf, 5)
tymax <- rep(Inf, 5)
season= c("1.winter", "2.spring","3.summer","4.autumn","1.winter")
rects <- data.frame(txmin,txmax,tymin,tymax,season)
rects[,5] <- sapply(rects[,5], as.character)
colScale <- c(col2hcl("blue"), col2hcl("green"), col2hcl("red"), col2hcl("purple"), col2hcl("blue"))


#Plotting
ggplot(data=gr_date_cat2.month) +
  geom_rect(data=rects, aes(xmin=txmin,xmax=txmax,ymin=tymin,ymax=tymax, 
                            fill=season), alpha=0.4)+
  scale_fill_manual(values=colScale)+
  geom_line(aes(x=month_nom, y=BCAT_NORM, group=1))+
  facet_wrap(~CATEGORY, scales="free_y")+
  labs(x="Month", y = "Normalised bottle sales", 
       title = "Liquor categories bottle sales with season overlay") + 
  theme(panel.background = element_rect(fill = 'white'),
        strip.text = element_text(size=7),
        strip.background = element_rect(fill="white"),
        axis.text.y = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks=element_blank(),
        plot.title = element_text(size = rel(1), colour = "blue"))

A few (qualitative) observations can be made. We can identify that certain liquor types have a demand curve that only peaks in December, for example liqueurs with various flavours like crème de menthe/cacao/etc., brandies and some types of schnapps (cinnamon, butterscotch). Given that these types of liquors have a heavier taste and are generally considered “warming”, their demand curve makes sense - we identified the “winter drinks”.

Another observation that makes sense, is the very similar curve shape of tequila and triple sec; two of the margarita’s ingredients (on the side I plotted respective bottle quantities to see if the 2:1 margarita ratio is reflected in bottle quantities - it didn’t). This demand pattern is there for a few of the liquors (flavoured rums, imported vodka), where bottle quantities are clearly correlated with expected average temperatures (demand lowest in winter, climbs gradually in spring, peaks between mid-spring/mid-summer, gradually descends in autumn). As opposed to the previous category, these are the “summer drinks”.

There is a lot of work that can still be performed here with good potential for hidden insights. Just by looking at the time series we can identify interesting patterns which don’t make immediate sense. For example:

  • Spiced rum peaks every three months at the beginning of each season
  • Whiskies and vodkas in general have highly heterogeneous demand patterns within their broader category
  • Barbados rum quantities peak only on one date - supply issue?

As a final insight in this section, let’s look at time series correlations between categories.

In order to pretty-print the correlation table, I will use the great corstarsl function taken from here.

#Took this function that makes nice correlation tables from here
#http://myowelt.blogspot.co.uk/2008/04/beautiful-correlation-tables-in-r.html
corstarsl <- function(x){ 
  require(Hmisc) 
  x <- as.matrix(x) 
  R <- rcorr(x)$r 
  p <- rcorr(x)$P 
  
  ## define notions for significance levels; spacing is important.
  mystars <- ifelse(p < .001, "***", ifelse(p < .01, "** ", ifelse(p < .05, "* ", " ")))
  
  ## trunctuate the matrix that holds the correlations to two decimal
  R <- format(round(cbind(rep(-1.11, ncol(x)), R), 2))[,-1] 
  
  ## build a new matrix that includes the correlations with their apropriate stars 
  Rnew <- matrix(paste(R, mystars, sep=""), ncol=ncol(x)) 
  diag(Rnew) <- paste(diag(R), " ", sep="") 
  rownames(Rnew) <- colnames(x) 
  colnames(Rnew) <- paste(colnames(x), "", sep="") 
  
  ## remove upper triangle
  Rnew <- as.matrix(Rnew)
  Rnew[upper.tri(Rnew, diag = TRUE)] <- ""
  Rnew <- as.data.frame(Rnew) 
  
  ## remove last column and return the matrix (which is now a data frame)
  Rnew <- cbind(Rnew[1:length(Rnew)-1])
  return(Rnew) 
}

And the table:

print.xtable(xtable(corstarsl(gr_date_cat.spread[,c(3:70)])), comment=F, type="html", size=1)
100 PROOF VODKA 80 PROOF VODKA AMERICAN ALCOHOL AMERICAN AMARETTO AMERICAN COCKTAILS AMERICAN DRY GINS AMERICAN GRAPE BRANDIES AMERICAN SLOE GINS ANISETTE APPLE SCHNAPPS APRICOT BRANDIES BARBADOS RUM BLACKBERRY BRANDIES BLENDED WHISKIES BOTTLED IN BOND BOURBON BUTTERSCOTCH SCHNAPPS CANADIAN WHISKIES CHERRY BRANDIES CINNAMON SCHNAPPS COFFEE LIQUEURS CREAM LIQUEURS CREME DE ALMOND DARK CREME DE CACAO DECANTERS & SPECIALTY PACKAGES DISTILLED SPIRITS SPECIALTY FLAVORED GIN FLAVORED RUM FLAVORED VODKA GRAPE SCHNAPPS GREEN CREME DE MENTHE HIGH PROOF BEER IMPORTED AMARETTO IMPORTED DRY GINS IMPORTED GRAPE BRANDIES IMPORTED SCHNAPPS IMPORTED VODKA IMPORTED VODKA - MISC IRISH WHISKIES JAMAICA RUM LOW PROOF VODKA MISC. AMERICAN CORDIALS & LIQUEURS MISC. IMPORTED CORDIALS & LIQUEURS MISCELLANEOUS BRANDIES MISCELLANEOUS SCHNAPPS OTHER PROOF VODKA PEACH BRANDIES PEACH SCHNAPPS PEPPERMINT SCHNAPPS PUERTO RICO & VIRGIN ISLANDS RUM RASPBERRY SCHNAPPS ROCK & RYE ROOT BEER SCHNAPPS SCOTCH WHISKIES SINGLE BARREL BOURBON WHISKIES SINGLE MALT SCOTCH SPEARMINT SCHNAPPS SPICED RUM STRAIGHT BOURBON WHISKIES STRAIGHT RYE WHISKIES STRAWBERRY SCHNAPPS TENNESSEE WHISKIES TEQUILA TRIPLE SEC TROPICAL FRUIT SCHNAPPS WATERMELON SCHNAPPS WHISKEY LIQUEUR WHITE CREME DE CACAO
100 PROOF VODKA
80 PROOF VODKA 0.65***
AMERICAN ALCOHOL 0.38*** 0.39***
AMERICAN AMARETTO 0.65*** 0.83*** 0.53***
AMERICAN COCKTAILS 0.49*** 0.76*** 0.20** 0.65***
AMERICAN DRY GINS 0.46*** 0.57*** 0.20** 0.40*** 0.37***
AMERICAN GRAPE BRANDIES 0.40*** 0.57*** 0.14 0.47*** 0.34*** 0.78***
AMERICAN SLOE GINS 0.33*** 0.50*** 0.26*** 0.50*** 0.50*** 0.19** 0.25***
ANISETTE 0.35*** 0.44*** 0.21** 0.38*** 0.29*** 0.31*** 0.27*** 0.16*
APPLE SCHNAPPS 0.54*** 0.73*** 0.38*** 0.71*** 0.58*** 0.47*** 0.55*** 0.40*** 0.36***
APRICOT BRANDIES 0.42*** 0.59*** 0.43*** 0.65*** 0.47*** 0.14* 0.14* 0.49*** 0.19* 0.39***
BARBADOS RUM 0.01 0.06 -0.05 0.00 0.01 0.01 0.04 -0.03 0.02 0.00 -0.05
BLACKBERRY BRANDIES 0.44*** 0.61*** 0.29*** 0.64*** 0.48*** 0.13 0.14 0.48*** 0.23** 0.39*** 0.81*** -0.06
BLENDED WHISKIES 0.64*** 0.89*** 0.46*** 0.85*** 0.68*** 0.54*** 0.59*** 0.57*** 0.43*** 0.81*** 0.54*** 0.04 0.56***
BOTTLED IN BOND BOURBON 0.26*** 0.24*** 0.16* 0.28*** 0.14 0.36*** 0.39*** 0.01 0.21* 0.20** 0.15* 0.01 0.11 0.30***
BUTTERSCOTCH SCHNAPPS 0.47*** 0.67*** 0.49*** 0.67*** 0.41*** 0.30*** 0.33*** 0.33*** 0.35*** 0.71*** 0.52*** -0.04 0.51*** 0.68*** 0.08
CANADIAN WHISKIES 0.51*** 0.79*** 0.41*** 0.71*** 0.66*** 0.33*** 0.35*** 0.50*** 0.29*** 0.51*** 0.72*** 0.03 0.70*** 0.69*** 0.19** 0.53***
CHERRY BRANDIES 0.33*** 0.54*** 0.27*** 0.50*** 0.45*** 0.12 0.14 0.45*** 0.18* 0.38*** 0.58*** -0.02 0.60*** 0.49*** 0.14 0.48*** 0.62***
CINNAMON SCHNAPPS 0.54*** 0.71*** 0.52*** 0.69*** 0.45*** 0.24*** 0.28*** 0.43*** 0.34*** 0.61*** 0.64*** -0.03 0.60*** 0.66*** 0.09 0.82*** 0.61*** 0.54***
COFFEE LIQUEURS 0.58*** 0.74*** 0.58*** 0.75*** 0.49*** 0.36*** 0.42*** 0.48*** 0.29*** 0.62*** 0.61*** 0.02 0.60*** 0.73*** 0.21** 0.68*** 0.70*** 0.50*** 0.71***
CREAM LIQUEURS 0.41*** 0.60*** 0.53*** 0.64*** 0.32*** 0.24*** 0.33*** 0.33*** 0.30*** 0.49*** 0.53*** 0.04 0.53*** 0.63*** 0.19** 0.64*** 0.59*** 0.40*** 0.62*** 0.72***
CREME DE ALMOND 0.34** 0.08 0.21 0.39*** -0.02 -0.05 0.00 0.06 -0.02 0.14 0.26* -0.07 0.45*** 0.14 -0.18 0.15 0.08 -0.01 0.30* 0.34** 0.31*
DARK CREME DE CACAO 0.28*** 0.55*** 0.37*** 0.59*** 0.39*** 0.21** 0.28*** 0.37*** 0.28*** 0.55*** 0.45*** -0.06 0.40*** 0.55*** 0.08 0.55*** 0.51*** 0.41*** 0.55*** 0.59*** 0.53*** 0.25*
DECANTERS & SPECIALTY PACKAGES -0.10 -0.16* 0.08 -0.12 -0.20** -0.18* -0.17* -0.13 0.07 -0.16* -0.14 -0.04 -0.15* -0.10 -0.07 -0.06 -0.07 -0.11 -0.09 -0.07 0.11 -0.19 -0.08
DISTILLED SPIRITS SPECIALTY 0.41*** 0.59*** 0.24*** 0.48*** 0.43*** 0.44*** 0.40*** 0.24*** 0.29*** 0.49*** 0.25*** -0.06 0.27*** 0.54*** 0.14 0.37*** 0.47*** 0.29*** 0.38*** 0.50*** 0.31*** 0.01 0.34*** -0.28***
FLAVORED GIN 0.54*** 0.68*** 0.31*** 0.65*** 0.51*** 0.50*** 0.57*** 0.41*** 0.32*** 0.60*** 0.38*** 0.04 0.36*** 0.72*** 0.23** 0.48*** 0.50*** 0.28*** 0.49*** 0.57*** 0.44*** 0.11 0.32*** -0.10 0.42***
FLAVORED RUM 0.53*** 0.81*** 0.24*** 0.76*** 0.81*** 0.52*** 0.57*** 0.46*** 0.32*** 0.66*** 0.45*** 0.06 0.45*** 0.79*** 0.24** 0.47*** 0.68*** 0.43*** 0.50*** 0.57*** 0.42*** 0.01 0.46*** -0.08 0.50*** 0.63***
FLAVORED VODKA 0.52*** 0.76*** 0.25*** 0.64*** 0.63*** 0.61*** 0.70*** 0.31*** 0.30*** 0.62*** 0.24*** 0.09 0.30*** 0.70*** 0.24** 0.44*** 0.48*** 0.21** 0.43*** 0.55*** 0.41*** 0.10 0.36*** -0.11 0.43*** 0.64*** 0.72***
GRAPE SCHNAPPS 0.35*** 0.47*** 0.04 0.38*** 0.34*** 0.60*** 0.72*** 0.16* 0.23** 0.62*** -0.05 0.01 -0.03 0.51*** 0.23** 0.35*** 0.13 0.03 0.23** 0.33*** 0.23** -0.03 0.28*** -0.18* 0.44*** 0.44*** 0.44*** 0.58***
GREEN CREME DE MENTHE 0.41*** 0.42*** 0.29*** 0.58*** 0.26*** 0.22** 0.28*** 0.34*** 0.12 0.43*** 0.46*** -0.05 0.62*** 0.51*** 0.15 0.42*** 0.33*** 0.22** 0.47*** 0.45*** 0.45*** 0.74*** 0.39*** -0.17* 0.16* 0.33*** 0.32*** 0.30*** 0.24***
HIGH PROOF BEER -0.29 -0.65 NaNNA -0.67 0.18 0.12 0.00 -0.55 -0.50 -0.35 -0.75 0.37 -0.63 -0.52 0.31 -0.99** -0.40 -0.83 -0.97* -0.94 -0.99** NaNNA -0.95* -0.15 0.38 -0.23 -0.39 -0.20 0.30 NaNNA
IMPORTED AMARETTO 0.45*** 0.62*** 0.43*** 0.64*** 0.43*** 0.24*** 0.34*** 0.35*** 0.16* 0.59*** 0.43*** 0.05 0.46*** 0.67*** 0.16* 0.58*** 0.51*** 0.37*** 0.56*** 0.62*** 0.55*** 0.27* 0.45*** -0.14 0.33*** 0.46*** 0.52*** 0.45*** 0.30*** 0.44*** -0.81
IMPORTED DRY GINS 0.39*** 0.69*** 0.17* 0.56*** 0.63*** 0.42*** 0.49*** 0.34*** 0.33*** 0.59*** 0.19** 0.06 0.18* 0.71*** 0.12 0.44*** 0.48*** 0.22** 0.41*** 0.44*** 0.40*** -0.14 0.41*** -0.05 0.45*** 0.55*** 0.74*** 0.62*** 0.54*** 0.16* -0.85 0.45***
IMPORTED GRAPE BRANDIES 0.38*** 0.51*** 0.12 0.40*** 0.38*** 0.70*** 0.91*** 0.30*** 0.16 0.48*** 0.12 0.04 0.13 0.54*** 0.37*** 0.23*** 0.35*** 0.11 0.19** 0.37*** 0.26*** -0.09 0.18* -0.13 0.35*** 0.57*** 0.56*** 0.65*** 0.65*** 0.21** 0.26 0.28*** 0.50***
IMPORTED SCHNAPPS 0.59*** 0.86*** 0.42*** 0.83*** 0.65*** 0.45*** 0.51*** 0.51*** 0.38*** 0.79*** 0.51*** 0.05 0.53*** 0.92*** 0.17* 0.70*** 0.65*** 0.46*** 0.67*** 0.74*** 0.63*** 0.06 0.56*** -0.14 0.55*** 0.69*** 0.74*** 0.68*** 0.52*** 0.47*** -0.80 0.69*** 0.70*** 0.48***
IMPORTED VODKA 0.52*** 0.82*** 0.32*** 0.69*** 0.63*** 0.56*** 0.60*** 0.40*** 0.43*** 0.67*** 0.29*** 0.07 0.33*** 0.82*** 0.22** 0.49*** 0.63*** 0.34*** 0.48*** 0.61*** 0.48*** -0.06 0.48*** -0.01 0.58*** 0.65*** 0.81*** 0.72*** 0.57*** 0.30*** -0.64 0.54*** 0.81*** 0.59*** 0.81***
IMPORTED VODKA - MISC 0.43*** 0.66*** 0.14 0.56*** 0.56*** 0.58*** 0.71*** 0.23*** 0.34*** 0.62*** 0.07 0.05 0.12 0.68*** 0.18* 0.37*** 0.39*** 0.15* 0.30*** 0.44*** 0.32*** -0.05 0.35*** -0.02 0.51*** 0.61*** 0.69*** 0.74*** 0.71*** 0.22** -0.04 0.39*** 0.77*** 0.72*** 0.69*** 0.81***
IRISH WHISKIES 0.36*** 0.50*** 0.28*** 0.45*** 0.23*** 0.58*** 0.48*** 0.21** 0.32*** 0.45*** 0.25*** -0.01 0.33*** 0.57*** 0.20** 0.38*** 0.33*** 0.23** 0.31*** 0.49*** 0.48*** 0.06 0.34*** -0.09 0.44*** 0.45*** 0.42*** 0.40*** 0.45*** 0.41*** -0.63 0.38*** 0.37*** 0.42*** 0.57*** 0.53*** 0.42***
JAMAICA RUM 0.47*** 0.60*** 0.30*** 0.59*** 0.42*** 0.43*** 0.50*** 0.31*** 0.35*** 0.55*** 0.32*** 0.10 0.29*** 0.64*** 0.29*** 0.50*** 0.40*** 0.24*** 0.42*** 0.55*** 0.38*** 0.00 0.33*** -0.06 0.41*** 0.57*** 0.57*** 0.52*** 0.48*** 0.27*** -0.84 0.51*** 0.56*** 0.49*** 0.65*** 0.60*** 0.55*** 0.36***
LOW PROOF VODKA 0.35*** 0.49*** 0.10 0.50*** 0.45*** 0.21** 0.17* 0.36*** 0.24** 0.28*** 0.51*** 0.01 0.57*** 0.45*** 0.25*** 0.28*** 0.56*** 0.42*** 0.34*** 0.39*** 0.29*** 0.16 0.34*** -0.10 0.15* 0.35*** 0.48*** 0.26*** 0.08 0.34*** -0.29 0.38*** 0.31*** 0.20** 0.40*** 0.35*** 0.19** 0.28*** 0.35***
MISC. AMERICAN CORDIALS & LIQUEURS 0.39*** 0.65*** 0.39*** 0.47*** 0.37*** 0.47*** 0.37*** 0.35*** 0.31*** 0.51*** 0.39*** 0.01 0.40*** 0.59*** 0.24** 0.52*** 0.57*** 0.48*** 0.55*** 0.54*** 0.51*** -0.04 0.38*** 0.06 0.41*** 0.39*** 0.44*** 0.42*** 0.30*** 0.20** -0.91 0.40*** 0.33*** 0.32*** 0.56*** 0.51*** 0.34*** 0.41*** 0.35*** 0.27***
MISC. IMPORTED CORDIALS & LIQUEURS 0.53*** 0.82*** 0.30*** 0.73*** 0.63*** 0.52*** 0.60*** 0.36*** 0.47*** 0.76*** 0.35*** 0.05 0.41*** 0.84*** 0.22** 0.61*** 0.65*** 0.40*** 0.55*** 0.66*** 0.56*** -0.03 0.59*** -0.11 0.60*** 0.62*** 0.75*** 0.68*** 0.62*** 0.36*** -0.89 0.58*** 0.80*** 0.54*** 0.85*** 0.87*** 0.78*** 0.54*** 0.65*** 0.40*** 0.49***
MISCELLANEOUS BRANDIES 0.10 0.17* 0.15 0.08 0.05 0.12 0.11 0.14 0.06 0.12 0.14 0.09 0.20** 0.16* -0.01 0.24** 0.09 0.13 0.21** 0.23** 0.15* -0.05 0.11 0.19* 0.15* 0.02 0.09 0.06 0.13 0.15* 0.21 0.14 -0.02 0.11 0.15* 0.06 0.05 0.14 0.13 0.03 0.34*** 0.09
MISCELLANEOUS SCHNAPPS 0.42*** 0.60*** 0.29*** 0.54*** 0.38*** 0.31*** 0.38*** 0.32*** 0.25** 0.55*** 0.33*** 0.05 0.34*** 0.67*** 0.06 0.52*** 0.42*** 0.29*** 0.50*** 0.52*** 0.50*** 0.10 0.40*** -0.13 0.46*** 0.55*** 0.50*** 0.51*** 0.37*** 0.29*** -0.85 0.51*** 0.53*** 0.35*** 0.66*** 0.58*** 0.51*** 0.39*** 0.48*** 0.23** 0.44*** 0.58*** 0.13
OTHER PROOF VODKA 0.39*** 0.42*** 0.40*** 0.45*** 0.26*** 0.35*** 0.26*** 0.24** 0.06 0.42*** 0.28*** 0.02 0.38*** 0.46*** 0.15 0.38*** 0.32*** 0.29*** 0.35*** 0.40*** 0.26*** 0.06 0.12 -0.14 0.14 0.47*** 0.25** 0.35*** 0.22** 0.50*** NaNNA 0.29*** 0.16* 0.36*** 0.50*** 0.29*** 0.25** 0.38*** 0.31*** 0.19* 0.35*** 0.29*** 0.12 0.34***
PEACH BRANDIES -0.28*** -0.34*** -0.09 -0.38*** -0.36*** -0.20** -0.11 -0.20** -0.13 -0.31*** -0.24*** 0.04 -0.24*** -0.36*** -0.17* -0.27*** -0.32*** -0.25*** -0.26*** -0.29*** -0.21** 0.01 -0.18* 0.08 -0.25*** -0.28*** -0.34*** -0.26*** -0.28*** -0.16* -0.35 -0.27*** -0.32*** -0.17* -0.39*** -0.32*** -0.26*** -0.25*** -0.30*** -0.32*** -0.23** -0.35*** -0.02 -0.24*** -0.15
PEACH SCHNAPPS 0.57*** 0.77*** 0.29*** 0.70*** 0.67*** 0.47*** 0.51*** 0.49*** 0.33*** 0.83*** 0.42*** 0.01 0.43*** 0.79*** 0.14 0.74*** 0.56*** 0.46*** 0.68*** 0.63*** 0.46*** 0.09 0.52*** -0.16* 0.50*** 0.61*** 0.72*** 0.64*** 0.61*** 0.39*** -0.75 0.59*** 0.65*** 0.48*** 0.78*** 0.68*** 0.61*** 0.44*** 0.60*** 0.36*** 0.53*** 0.75*** 0.20** 0.55*** 0.37*** -0.35***
PEPPERMINT SCHNAPPS 0.57*** 0.77*** 0.49*** 0.82*** 0.65*** 0.22** 0.26*** 0.52*** 0.34*** 0.61*** 0.77*** -0.02 0.74*** 0.75*** 0.15* 0.71*** 0.76*** 0.59*** 0.76*** 0.75*** 0.69*** 0.26* 0.59*** -0.07 0.38*** 0.51*** 0.61*** 0.49*** 0.15* 0.47*** -0.56 0.59*** 0.42*** 0.21** 0.73*** 0.52*** 0.35*** 0.31*** 0.46*** 0.49*** 0.46*** 0.61*** 0.15* 0.52*** 0.28*** -0.31*** 0.60***
PUERTO RICO & VIRGIN ISLANDS RUM 0.65*** 0.91*** 0.40*** 0.82*** 0.73*** 0.56*** 0.58*** 0.52*** 0.35*** 0.71*** 0.53*** 0.02 0.53*** 0.88*** 0.23** 0.60*** 0.81*** 0.48*** 0.63*** 0.72*** 0.58*** 0.15 0.55*** -0.13 0.58*** 0.69*** 0.83*** 0.74*** 0.48*** 0.42*** -0.73 0.61*** 0.76*** 0.55*** 0.86*** 0.86*** 0.70*** 0.49*** 0.60*** 0.50*** 0.55*** 0.88*** 0.08 0.61*** 0.36*** -0.38*** 0.74*** 0.72***
RASPBERRY SCHNAPPS 0.44*** 0.63*** 0.32*** 0.64*** 0.45*** 0.38*** 0.42*** 0.44*** 0.21* 0.58*** 0.37*** 0.05 0.35*** 0.66*** 0.12 0.51*** 0.45*** 0.34*** 0.48*** 0.48*** 0.44*** 0.13 0.40*** -0.15* 0.40*** 0.57*** 0.56*** 0.48*** 0.42*** 0.40*** -0.80 0.41*** 0.56*** 0.44*** 0.66*** 0.61*** 0.54*** 0.39*** 0.46*** 0.29*** 0.36*** 0.58*** 0.02 0.54*** 0.38*** -0.32*** 0.60*** 0.50*** 0.62***
ROCK & RYE 0.32*** 0.48*** 0.25** 0.45*** 0.35*** 0.26*** 0.22** 0.17* 0.15 0.34*** 0.41*** -0.05 0.45*** 0.41*** 0.23** 0.39*** 0.45*** 0.44*** 0.37*** 0.43*** 0.29*** -0.01 0.34*** -0.08 0.23** 0.29*** 0.34*** 0.26*** 0.17* 0.27*** -0.05 0.26*** 0.25*** 0.21** 0.39*** 0.34*** 0.22** 0.24** 0.33*** 0.44*** 0.41*** 0.42*** 0.13 0.27*** 0.27*** -0.25*** 0.34*** 0.46*** 0.42*** 0.19*
ROOT BEER SCHNAPPS 0.38*** 0.62*** 0.28*** 0.60*** 0.59*** 0.25*** 0.23** 0.37*** 0.22** 0.54*** 0.44*** 0.01 0.45*** 0.62*** 0.06 0.48*** 0.51*** 0.41*** 0.54*** 0.54*** 0.42*** 0.07 0.41*** -0.15* 0.35*** 0.43*** 0.54*** 0.46*** 0.32*** 0.30*** 0.36 0.42*** 0.48*** 0.20** 0.65*** 0.49*** 0.38*** 0.40*** 0.43*** 0.38*** 0.38*** 0.57*** -0.02 0.37*** 0.22** -0.34*** 0.60*** 0.58*** 0.60*** 0.47*** 0.34***
SCOTCH WHISKIES 0.54*** 0.83*** 0.37*** 0.75*** 0.70*** 0.37*** 0.40*** 0.48*** 0.39*** 0.62*** 0.57*** -0.01 0.60*** 0.79*** 0.17* 0.61*** 0.75*** 0.53*** 0.61*** 0.68*** 0.59*** 0.07 0.50*** -0.05 0.47*** 0.60*** 0.73*** 0.60*** 0.36*** 0.37*** -0.74 0.63*** 0.72*** 0.39*** 0.76*** 0.71*** 0.55*** 0.43*** 0.58*** 0.58*** 0.48*** 0.80*** 0.17* 0.48*** 0.26*** -0.39*** 0.68*** 0.74*** 0.85*** 0.53*** 0.37*** 0.59***
SINGLE BARREL BOURBON WHISKIES 0.12 0.21** 0.17* 0.17* 0.19** 0.15* 0.26*** 0.08 0.09 0.28*** 0.01 -0.02 0.05 0.27*** 0.19* 0.21** 0.20** 0.04 0.14* 0.12 0.12 -0.04 0.15* 0.07 0.22** 0.23** 0.20** 0.19** 0.25*** 0.15* 0.75 0.15* 0.32*** 0.34*** 0.26*** 0.32*** 0.36*** 0.08 0.21** -0.03 0.07 0.34*** 0.15* 0.16* 0.15 -0.10 0.24*** 0.12 0.25*** 0.19** 0.09 0.11 0.24***
SINGLE MALT SCOTCH 0.50*** 0.70*** 0.34*** 0.60*** 0.48*** 0.46*** 0.50*** 0.33*** 0.39*** 0.67*** 0.32*** 0.03 0.33*** 0.78*** 0.14 0.58*** 0.50*** 0.32*** 0.49*** 0.57*** 0.54*** -0.11 0.53*** -0.02 0.48*** 0.53*** 0.59*** 0.58*** 0.49*** 0.30*** -0.64 0.56*** 0.67*** 0.46*** 0.75*** 0.74*** 0.65*** 0.51*** 0.59*** 0.30*** 0.50*** 0.77*** 0.18* 0.59*** 0.28*** -0.27*** 0.64*** 0.52*** 0.71*** 0.50*** 0.42*** 0.51*** 0.65*** 0.25***
SPEARMINT SCHNAPPS 0.15* 0.35*** 0.04 0.29*** 0.18* 0.32*** 0.38*** 0.09 0.23** 0.35*** 0.21** -0.01 0.24** 0.32*** 0.16* 0.35*** 0.26*** 0.23** 0.27*** 0.26*** 0.29*** -0.19 0.26*** -0.18* 0.23** 0.33*** 0.28*** 0.29*** 0.38*** 0.23** -0.86 0.15* 0.26*** 0.33*** 0.35*** 0.28*** 0.32*** 0.38*** 0.30*** 0.24** 0.18* 0.41*** 0.02 0.18* 0.35*** -0.11 0.36*** 0.24** 0.30*** 0.29*** 0.17* 0.30*** 0.37*** 0.11 0.28***
SPICED RUM 0.64*** 0.92*** 0.44*** 0.83*** 0.73*** 0.46*** 0.52*** 0.55*** 0.38*** 0.70*** 0.66*** 0.05 0.61*** 0.87*** 0.21** 0.69*** 0.79*** 0.57*** 0.75*** 0.73*** 0.69*** 0.06 0.55*** -0.12 0.53*** 0.64*** 0.77*** 0.69*** 0.37*** 0.43*** -0.82 0.63*** 0.66*** 0.46*** 0.86*** 0.75*** 0.59*** 0.48*** 0.55*** 0.46*** 0.60*** 0.75*** 0.11 0.57*** 0.38*** -0.34*** 0.71*** 0.83*** 0.86*** 0.61*** 0.38*** 0.60*** 0.81*** 0.19** 0.67*** 0.31***
STRAIGHT BOURBON WHISKIES 0.61*** 0.90*** 0.45*** 0.80*** 0.64*** 0.46*** 0.51*** 0.47*** 0.48*** 0.73*** 0.58*** 0.02 0.63*** 0.88*** 0.28*** 0.70*** 0.76*** 0.61*** 0.73*** 0.73*** 0.64*** 0.14 0.60*** -0.08 0.57*** 0.60*** 0.71*** 0.62*** 0.45*** 0.47*** -0.60 0.67*** 0.68*** 0.41*** 0.85*** 0.79*** 0.60*** 0.51*** 0.58*** 0.48*** 0.67*** 0.84*** 0.23** 0.59*** 0.35*** -0.36*** 0.75*** 0.76*** 0.86*** 0.58*** 0.50*** 0.62*** 0.83*** 0.25*** 0.73*** 0.29*** 0.86***
STRAIGHT RYE WHISKIES 0.41*** 0.62*** 0.33*** 0.56*** 0.55*** 0.35*** 0.39*** 0.25*** 0.33*** 0.63*** 0.29*** -0.05 0.29*** 0.62*** 0.11 0.49*** 0.37*** 0.25*** 0.43*** 0.47*** 0.37*** 0.00 0.45*** -0.17* 0.41*** 0.44*** 0.52*** 0.57*** 0.42*** 0.32*** 0.08 0.50*** 0.48*** 0.33*** 0.61*** 0.51*** 0.51*** 0.38*** 0.44*** 0.14 0.27*** 0.61*** 0.11 0.42*** 0.28*** -0.20** 0.57*** 0.56*** 0.55*** 0.43*** 0.24*** 0.39*** 0.56*** 0.21** 0.54*** 0.18* 0.58*** 0.58***
STRAWBERRY SCHNAPPS 0.41*** 0.62*** 0.26*** 0.58*** 0.42*** 0.47*** 0.49*** 0.23** 0.28*** 0.58*** 0.22** 0.03 0.24*** 0.62*** 0.15* 0.43*** 0.39*** 0.28*** 0.38*** 0.52*** 0.38*** -0.08 0.42*** -0.11 0.51*** 0.45*** 0.55*** 0.58*** 0.54*** 0.23** -0.86 0.46*** 0.57*** 0.44*** 0.65*** 0.63*** 0.60*** 0.51*** 0.53*** 0.27*** 0.39*** 0.68*** 0.09 0.46*** 0.32*** -0.31*** 0.57*** 0.39*** 0.62*** 0.47*** 0.24** 0.43*** 0.58*** 0.24*** 0.57*** 0.42*** 0.54*** 0.57*** 0.44***
TENNESSEE WHISKIES 0.40*** 0.75*** 0.32*** 0.61*** 0.69*** 0.36*** 0.38*** 0.42*** 0.37*** 0.57*** 0.41*** 0.08 0.43*** 0.68*** 0.09 0.44*** 0.57*** 0.43*** 0.47*** 0.55*** 0.54*** 0.01 0.46*** -0.15* 0.38*** 0.44*** 0.63*** 0.59*** 0.37*** 0.31*** -0.02 0.48*** 0.53*** 0.33*** 0.67*** 0.59*** 0.48*** 0.40*** 0.43*** 0.31*** 0.48*** 0.60*** 0.12 0.45*** 0.27*** -0.28*** 0.60*** 0.63*** 0.66*** 0.34*** 0.37*** 0.54*** 0.66*** 0.11 0.56*** 0.29*** 0.75*** 0.66*** 0.53*** 0.49***
TEQUILA 0.54*** 0.80*** 0.22** 0.66*** 0.70*** 0.59*** 0.66*** 0.39*** 0.42*** 0.72*** 0.24*** 0.08 0.26*** 0.80*** 0.18* 0.49*** 0.56*** 0.29*** 0.43*** 0.55*** 0.42*** -0.07 0.47*** -0.12 0.52*** 0.65*** 0.75*** 0.75*** 0.68*** 0.28*** -0.37 0.51*** 0.80*** 0.65*** 0.79*** 0.84*** 0.84*** 0.46*** 0.59*** 0.31*** 0.43*** 0.87*** 0.04 0.52*** 0.36*** -0.28*** 0.73*** 0.51*** 0.83*** 0.60*** 0.32*** 0.53*** 0.73*** 0.34*** 0.71*** 0.35*** 0.72*** 0.72*** 0.57*** 0.69*** 0.61***
TRIPLE SEC 0.42*** 0.61*** 0.10 0.49*** 0.46*** 0.57*** 0.60*** 0.17* 0.39*** 0.59*** -0.01 0.01 0.05 0.62*** 0.17* 0.38*** 0.30*** 0.07 0.26*** 0.38*** 0.25*** -0.03 0.32*** -0.12 0.53*** 0.53*** 0.59*** 0.60*** 0.73*** 0.20** -0.63 0.37*** 0.76*** 0.55*** 0.65*** 0.74*** 0.82*** 0.47*** 0.55*** 0.17* 0.28*** 0.80*** -0.01 0.43*** 0.22** -0.31*** 0.58*** 0.27*** 0.67*** 0.49*** 0.20** 0.39*** 0.57*** 0.30*** 0.60*** 0.36*** 0.49*** 0.58*** 0.45*** 0.65*** 0.42*** 0.86***
TROPICAL FRUIT SCHNAPPS 0.44*** 0.55*** 0.18* 0.59*** 0.63*** 0.25*** 0.27*** 0.51*** 0.15 0.44*** 0.45*** 0.04 0.45*** 0.57*** 0.10 0.36*** 0.49*** 0.37*** 0.38*** 0.35*** 0.30*** 0.01 0.27*** -0.16* 0.31*** 0.54*** 0.66*** 0.45*** 0.19** 0.28*** -0.32 0.36*** 0.46*** 0.29*** 0.54*** 0.42*** 0.33*** 0.24*** 0.40*** 0.36*** 0.25*** 0.45*** 0.05 0.38*** 0.29*** -0.29*** 0.53*** 0.51*** 0.59*** 0.49*** 0.18* 0.46*** 0.54*** 0.04 0.34*** 0.22** 0.58*** 0.50*** 0.37*** 0.36*** 0.40*** 0.44*** 0.28***
WATERMELON SCHNAPPS 0.42*** 0.59*** 0.17* 0.53*** 0.52*** 0.50*** 0.59*** 0.29*** 0.24** 0.76*** 0.19** 0.02 0.18** 0.68*** 0.18* 0.54*** 0.30*** 0.20** 0.41*** 0.44*** 0.35*** -0.04 0.36*** -0.18* 0.46*** 0.55*** 0.59*** 0.62*** 0.77*** 0.31*** -0.71 0.44*** 0.62*** 0.55*** 0.71*** 0.62*** 0.68*** 0.47*** 0.50*** 0.19** 0.36*** 0.67*** 0.12 0.52*** 0.36*** -0.34*** 0.77*** 0.37*** 0.61*** 0.58*** 0.18* 0.46*** 0.49*** 0.24*** 0.55*** 0.37*** 0.55*** 0.57*** 0.50*** 0.60*** 0.48*** 0.72*** 0.69*** 0.43***
WHISKEY LIQUEUR 0.54*** 0.88*** 0.36*** 0.72*** 0.64*** 0.51*** 0.56*** 0.44*** 0.47*** 0.69*** 0.46*** 0.13 0.51*** 0.83*** 0.17* 0.65*** 0.69*** 0.48*** 0.61*** 0.70*** 0.64*** 0.05 0.51*** -0.09 0.56*** 0.59*** 0.70*** 0.68*** 0.52*** 0.42*** -0.51 0.59*** 0.62*** 0.47*** 0.84*** 0.78*** 0.65*** 0.55*** 0.61*** 0.35*** 0.66*** 0.81*** 0.23** 0.57*** 0.43*** -0.31*** 0.73*** 0.66*** 0.80*** 0.55*** 0.39*** 0.61*** 0.72*** 0.22** 0.71*** 0.36*** 0.82*** 0.83*** 0.58*** 0.60*** 0.75*** 0.77*** 0.61*** 0.44*** 0.61***
WHITE CREME DE CACAO 0.30*** 0.50*** 0.14 0.32*** 0.22** 0.27*** 0.31*** 0.23** 0.27*** 0.37*** 0.29*** -0.01 0.30*** 0.44*** 0.27*** 0.43*** 0.35*** 0.53*** 0.46*** 0.37*** 0.27*** -0.03 0.39*** -0.14* 0.31*** 0.26*** 0.23** 0.25*** 0.28*** 0.30*** 0.70 0.32*** 0.21** 0.25*** 0.43*** 0.32*** 0.27*** 0.27*** 0.31*** 0.20** 0.53*** 0.42*** 0.27*** 0.28*** 0.36*** -0.14 0.43*** 0.32*** 0.35*** 0.31*** 0.43*** 0.31*** 0.30*** 0.16* 0.40*** 0.15* 0.39*** 0.60*** 0.33*** 0.25*** 0.26*** 0.33*** 0.21** 0.13 0.28*** 0.55***
WHITE CREME DE MENTHE 0.29*** 0.36*** 0.15* 0.39*** 0.25*** 0.26*** 0.32*** 0.27*** 0.08 0.31*** 0.29*** 0.08 0.31*** 0.37*** 0.35*** 0.29*** 0.27*** 0.19* 0.33*** 0.35*** 0.25*** 0.14 0.29*** -0.10 0.15 0.18* 0.29*** 0.39*** 0.20** 0.41*** NaNNA 0.24** 0.15* 0.30*** 0.33*** 0.27*** 0.22** 0.22** 0.33*** 0.25*** 0.26*** 0.26*** 0.10 0.23** 0.15 -0.14 0.32*** 0.30*** 0.34*** 0.28*** 0.19* 0.27*** 0.20** 0.13 0.29*** 0.09 0.35*** 0.36*** 0.17* 0.26*** 0.18* 0.24** 0.08 0.18* 0.19* 0.36*** 0.41***
corr.cat.ts <- rcorr(as.matrix(gr_date_cat.spread[,c(3:70)]))$r
corr.cat.ts <- data.frame(corr.cat.ts)
corr.cat.ts.dec <- apply(corr.cat.ts, 2, sort, decreasing=T)
corr.cat.ts.inc <- apply(corr.cat.ts,2, sort)
corr.cat.ts.min <- lapply(corr.cat.ts.inc, function(x) x[1:3])
corr.cat.ts.max <- lapply(corr.cat.ts.dec, function(x) x[2:4])

In the following lists we display the highest correlations (positive and negative) per category.

#Maximum positive correlations per category
sapply(corr.cat.ts.max, function(x) names(x))
##      X100.PROOF.VODKA                   X80.PROOF.VODKA                   
## [1,] "80 PROOF VODKA"                   "SPICED RUM"                      
## [2,] "PUERTO RICO & VIRGIN ISLANDS RUM" "PUERTO RICO & VIRGIN ISLANDS RUM"
## [3,] "AMERICAN AMARETTO"                "STRAIGHT BOURBON WHISKIES"       
##      AMERICAN.ALCOHOL    AMERICAN.AMARETTO  
## [1,] "COFFEE LIQUEURS"   "BLENDED WHISKIES" 
## [2,] "CREAM LIQUEURS"    "IMPORTED SCHNAPPS"
## [3,] "AMERICAN AMARETTO" "SPICED RUM"       
##      AMERICAN.COCKTAILS                 AMERICAN.DRY.GINS        
## [1,] "FLAVORED RUM"                     "AMERICAN GRAPE BRANDIES"
## [2,] "80 PROOF VODKA"                   "IMPORTED GRAPE BRANDIES"
## [3,] "PUERTO RICO & VIRGIN ISLANDS RUM" "FLAVORED VODKA"         
##      AMERICAN.GRAPE.BRANDIES   AMERICAN.SLOE.GINS   
## [1,] "IMPORTED GRAPE BRANDIES" "BLENDED WHISKIES"   
## [2,] "AMERICAN DRY GINS"       "SPICED RUM"         
## [3,] "GRAPE SCHNAPPS"          "PEPPERMINT SCHNAPPS"
##      ANISETTE                             APPLE.SCHNAPPS     
## [1,] "STRAIGHT BOURBON WHISKIES"          "PEACH SCHNAPPS"   
## [2,] "MISC. IMPORTED CORDIALS & LIQUEURS" "BLENDED WHISKIES" 
## [3,] "WHISKEY LIQUEUR"                    "IMPORTED SCHNAPPS"
##      APRICOT.BRANDIES      BARBADOS.RUM      BLACKBERRY.BRANDIES  
## [1,] "BLACKBERRY BRANDIES" "HIGH PROOF BEER" "APRICOT BRANDIES"   
## [2,] "PEPPERMINT SCHNAPPS" "WHISKEY LIQUEUR" "PEPPERMINT SCHNAPPS"
## [3,] "CANADIAN WHISKIES"   "JAMAICA RUM"     "CANADIAN WHISKIES"  
##      BLENDED.WHISKIES                   BOTTLED.IN.BOND.BOURBON  
## [1,] "IMPORTED SCHNAPPS"                "AMERICAN GRAPE BRANDIES"
## [2,] "80 PROOF VODKA"                   "IMPORTED GRAPE BRANDIES"
## [3,] "PUERTO RICO & VIRGIN ISLANDS RUM" "AMERICAN DRY GINS"      
##      BUTTERSCOTCH.SCHNAPPS CANADIAN.WHISKIES                 
## [1,] "CINNAMON SCHNAPPS"   "PUERTO RICO & VIRGIN ISLANDS RUM"
## [2,] "PEACH SCHNAPPS"      "80 PROOF VODKA"                  
## [3,] "APPLE SCHNAPPS"      "SPICED RUM"                      
##      CHERRY.BRANDIES             CINNAMON.SCHNAPPS      
## [1,] "CANADIAN WHISKIES"         "BUTTERSCOTCH SCHNAPPS"
## [2,] "STRAIGHT BOURBON WHISKIES" "PEPPERMINT SCHNAPPS"  
## [3,] "BLACKBERRY BRANDIES"       "SPICED RUM"           
##      COFFEE.LIQUEURS       CREAM.LIQUEURS        CREME.DE.ALMOND        
## [1,] "PEPPERMINT SCHNAPPS" "COFFEE LIQUEURS"     "GREEN CREME DE MENTHE"
## [2,] "AMERICAN AMARETTO"   "PEPPERMINT SCHNAPPS" "BLACKBERRY BRANDIES"  
## [3,] "IMPORTED SCHNAPPS"   "SPICED RUM"          "AMERICAN AMARETTO"    
##      DARK.CREME.DE.CACAO                  DECANTERS...SPECIALTY.PACKAGES
## [1,] "STRAIGHT BOURBON WHISKIES"          "MISCELLANEOUS BRANDIES"      
## [2,] "PEPPERMINT SCHNAPPS"                "CREAM LIQUEURS"              
## [3,] "MISC. IMPORTED CORDIALS & LIQUEURS" "AMERICAN ALCOHOL"            
##      DISTILLED.SPIRITS.SPECIALTY         
## [1,] "MISC. IMPORTED CORDIALS & LIQUEURS"
## [2,] "80 PROOF VODKA"                    
## [3,] "PUERTO RICO & VIRGIN ISLANDS RUM"  
##      FLAVORED.GIN                       FLAVORED.RUM                      
## [1,] "BLENDED WHISKIES"                 "PUERTO RICO & VIRGIN ISLANDS RUM"
## [2,] "IMPORTED SCHNAPPS"                "AMERICAN COCKTAILS"              
## [3,] "PUERTO RICO & VIRGIN ISLANDS RUM" "80 PROOF VODKA"                  
##      FLAVORED.VODKA                     GRAPE.SCHNAPPS           
## [1,] "80 PROOF VODKA"                   "WATERMELON SCHNAPPS"    
## [2,] "TEQUILA"                          "TRIPLE SEC"             
## [3,] "PUERTO RICO & VIRGIN ISLANDS RUM" "AMERICAN GRAPE BRANDIES"
##      GREEN.CREME.DE.MENTHE HIGH.PROOF.BEER                 
## [1,] "CREME DE ALMOND"     "SINGLE BARREL BOURBON WHISKIES"
## [2,] "BLACKBERRY BRANDIES" "WHITE CREME DE CACAO"          
## [3,] "AMERICAN AMARETTO"   "DISTILLED SPIRITS SPECIALTY"   
##      IMPORTED.AMARETTO           IMPORTED.DRY.GINS                   
## [1,] "IMPORTED SCHNAPPS"         "IMPORTED VODKA"                    
## [2,] "STRAIGHT BOURBON WHISKIES" "TEQUILA"                           
## [3,] "BLENDED WHISKIES"          "MISC. IMPORTED CORDIALS & LIQUEURS"
##      IMPORTED.GRAPE.BRANDIES   IMPORTED.SCHNAPPS 
## [1,] "AMERICAN GRAPE BRANDIES" "BLENDED WHISKIES"
## [2,] "IMPORTED VODKA - MISC"   "80 PROOF VODKA"  
## [3,] "AMERICAN DRY GINS"       "SPICED RUM"      
##      IMPORTED.VODKA                       IMPORTED.VODKA...MISC
## [1,] "MISC. IMPORTED CORDIALS & LIQUEURS" "TEQUILA"            
## [2,] "PUERTO RICO & VIRGIN ISLANDS RUM"   "TRIPLE SEC"         
## [3,] "TEQUILA"                            "IMPORTED VODKA"     
##      IRISH.WHISKIES      JAMAICA.RUM                         
## [1,] "AMERICAN DRY GINS" "IMPORTED SCHNAPPS"                 
## [2,] "BLENDED WHISKIES"  "MISC. IMPORTED CORDIALS & LIQUEURS"
## [3,] "IMPORTED SCHNAPPS" "BLENDED WHISKIES"                  
##      LOW.PROOF.VODKA       MISC..AMERICAN.CORDIALS...LIQUEURS
## [1,] "SCOTCH WHISKIES"     "STRAIGHT BOURBON WHISKIES"       
## [2,] "BLACKBERRY BRANDIES" "WHISKEY LIQUEUR"                 
## [3,] "CANADIAN WHISKIES"   "80 PROOF VODKA"                  
##      MISC..IMPORTED.CORDIALS...LIQUEURS
## [1,] "PUERTO RICO & VIRGIN ISLANDS RUM"
## [2,] "TEQUILA"                         
## [3,] "IMPORTED VODKA"                  
##      MISCELLANEOUS.BRANDIES              
## [1,] "MISC. AMERICAN CORDIALS & LIQUEURS"
## [2,] "WHITE CREME DE CACAO"              
## [3,] "BUTTERSCOTCH SCHNAPPS"             
##      MISCELLANEOUS.SCHNAPPS             OTHER.PROOF.VODKA      
## [1,] "BLENDED WHISKIES"                 "GREEN CREME DE MENTHE"
## [2,] "IMPORTED SCHNAPPS"                "IMPORTED SCHNAPPS"    
## [3,] "PUERTO RICO & VIRGIN ISLANDS RUM" "FLAVORED GIN"         
##      PEACH.BRANDIES                   PEACH.SCHNAPPS     
## [1,] "DECANTERS & SPECIALTY PACKAGES" "APPLE SCHNAPPS"   
## [2,] "BARBADOS RUM"                   "BLENDED WHISKIES" 
## [3,] "CREME DE ALMOND"                "IMPORTED SCHNAPPS"
##      PEPPERMINT.SCHNAPPS PUERTO.RICO...VIRGIN.ISLANDS.RUM    
## [1,] "SPICED RUM"        "80 PROOF VODKA"                    
## [2,] "AMERICAN AMARETTO" "BLENDED WHISKIES"                  
## [3,] "80 PROOF VODKA"    "MISC. IMPORTED CORDIALS & LIQUEURS"
##      RASPBERRY.SCHNAPPS  ROCK...RYE                  ROOT.BEER.SCHNAPPS 
## [1,] "IMPORTED SCHNAPPS" "STRAIGHT BOURBON WHISKIES" "IMPORTED SCHNAPPS"
## [2,] "BLENDED WHISKIES"  "80 PROOF VODKA"            "80 PROOF VODKA"   
## [3,] "AMERICAN AMARETTO" "PEPPERMINT SCHNAPPS"       "BLENDED WHISKIES" 
##      SCOTCH.WHISKIES                   
## [1,] "PUERTO RICO & VIRGIN ISLANDS RUM"
## [2,] "80 PROOF VODKA"                  
## [3,] "STRAIGHT BOURBON WHISKIES"       
##      SINGLE.BARREL.BOURBON.WHISKIES      
## [1,] "HIGH PROOF BEER"                   
## [2,] "IMPORTED VODKA - MISC"             
## [3,] "MISC. IMPORTED CORDIALS & LIQUEURS"
##      SINGLE.MALT.SCOTCH                  
## [1,] "BLENDED WHISKIES"                  
## [2,] "MISC. IMPORTED CORDIALS & LIQUEURS"
## [3,] "IMPORTED SCHNAPPS"                 
##      SPEARMINT.SCHNAPPS                  
## [1,] "STRAWBERRY SCHNAPPS"               
## [2,] "MISC. IMPORTED CORDIALS & LIQUEURS"
## [3,] "IRISH WHISKIES"                    
##      SPICED.RUM                         STRAIGHT.BOURBON.WHISKIES         
## [1,] "80 PROOF VODKA"                   "80 PROOF VODKA"                  
## [2,] "BLENDED WHISKIES"                 "BLENDED WHISKIES"                
## [3,] "PUERTO RICO & VIRGIN ISLANDS RUM" "PUERTO RICO & VIRGIN ISLANDS RUM"
##      STRAIGHT.RYE.WHISKIES STRAWBERRY.SCHNAPPS                 
## [1,] "APPLE SCHNAPPS"      "TEQUILA"                           
## [2,] "BLENDED WHISKIES"    "MISC. IMPORTED CORDIALS & LIQUEURS"
## [3,] "80 PROOF VODKA"      "TRIPLE SEC"                        
##      TENNESSEE.WHISKIES TEQUILA                             
## [1,] "WHISKEY LIQUEUR"  "MISC. IMPORTED CORDIALS & LIQUEURS"
## [2,] "80 PROOF VODKA"   "TRIPLE SEC"                        
## [3,] "SPICED RUM"       "IMPORTED VODKA - MISC"             
##      TRIPLE.SEC                           TROPICAL.FRUIT.SCHNAPPS
## [1,] "TEQUILA"                            "FLAVORED RUM"         
## [2,] "IMPORTED VODKA - MISC"              "AMERICAN COCKTAILS"   
## [3,] "MISC. IMPORTED CORDIALS & LIQUEURS" "AMERICAN AMARETTO"    
##      WATERMELON.SCHNAPPS WHISKEY.LIQUEUR            
## [1,] "GRAPE SCHNAPPS"    "80 PROOF VODKA"           
## [2,] "PEACH SCHNAPPS"    "IMPORTED SCHNAPPS"        
## [3,] "APPLE SCHNAPPS"    "STRAIGHT BOURBON WHISKIES"
##      WHITE.CREME.DE.CACAO        WHITE.CREME.DE.MENTHE  
## [1,] "HIGH PROOF BEER"           "WHITE CREME DE CACAO" 
## [2,] "STRAIGHT BOURBON WHISKIES" "GREEN CREME DE MENTHE"
## [3,] "WHISKEY LIQUEUR"           "AMERICAN AMARETTO"

Looking at the results, I was again happy to see that the highest correlated liquor with Triple sec is Tequila (0.86) - Margarita cocktails are very popular in Iowa also it seems! The correlation of Triple sec with “Imported Vodka”" was notably high as well (0.82).

Looking at the top correlations for Tequila, we see cocktail mixers (cordials/liqueurs and triple sec) and the third one is also imported vodka.

A quick look at the negative correlations now.

#Maximum negative correlations per categor
sapply(corr.cat.ts.min, function(x) names(x))
##      X100.PROOF.VODKA                 X80.PROOF.VODKA                 
## [1,] "HIGH PROOF BEER"                "HIGH PROOF BEER"               
## [2,] "PEACH BRANDIES"                 "PEACH BRANDIES"                
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
##      AMERICAN.ALCOHOL     AMERICAN.AMARETTO               
## [1,] "PEACH BRANDIES"     "HIGH PROOF BEER"               
## [2,] "BARBADOS RUM"       "PEACH BRANDIES"                
## [3,] "SPEARMINT SCHNAPPS" "DECANTERS & SPECIALTY PACKAGES"
##      AMERICAN.COCKTAILS               AMERICAN.DRY.GINS               
## [1,] "PEACH BRANDIES"                 "PEACH BRANDIES"                
## [2,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
## [3,] "CREME DE ALMOND"                "CREME DE ALMOND"               
##      AMERICAN.GRAPE.BRANDIES          AMERICAN.SLOE.GINS              
## [1,] "DECANTERS & SPECIALTY PACKAGES" "HIGH PROOF BEER"               
## [2,] "PEACH BRANDIES"                 "PEACH BRANDIES"                
## [3,] "HIGH PROOF BEER"                "DECANTERS & SPECIALTY PACKAGES"
##      ANISETTE          APPLE.SCHNAPPS                  
## [1,] "HIGH PROOF BEER" "HIGH PROOF BEER"               
## [2,] "PEACH BRANDIES"  "PEACH BRANDIES"                
## [3,] "CREME DE ALMOND" "DECANTERS & SPECIALTY PACKAGES"
##      APRICOT.BRANDIES                 BARBADOS.RUM         
## [1,] "HIGH PROOF BEER"                "CREME DE ALMOND"    
## [2,] "PEACH BRANDIES"                 "DARK CREME DE CACAO"
## [3,] "DECANTERS & SPECIALTY PACKAGES" "BLACKBERRY BRANDIES"
##      BLACKBERRY.BRANDIES              BLENDED.WHISKIES                
## [1,] "HIGH PROOF BEER"                "HIGH PROOF BEER"               
## [2,] "PEACH BRANDIES"                 "PEACH BRANDIES"                
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
##      BOTTLED.IN.BOND.BOURBON          BUTTERSCOTCH.SCHNAPPS           
## [1,] "CREME DE ALMOND"                "HIGH PROOF BEER"               
## [2,] "PEACH BRANDIES"                 "PEACH BRANDIES"                
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
##      CANADIAN.WHISKIES                CHERRY.BRANDIES                 
## [1,] "HIGH PROOF BEER"                "HIGH PROOF BEER"               
## [2,] "PEACH BRANDIES"                 "PEACH BRANDIES"                
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
##      CINNAMON.SCHNAPPS                COFFEE.LIQUEURS                 
## [1,] "HIGH PROOF BEER"                "HIGH PROOF BEER"               
## [2,] "PEACH BRANDIES"                 "PEACH BRANDIES"                
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
##      CREAM.LIQUEURS    CREME.DE.ALMOND                 
## [1,] "HIGH PROOF BEER" "SPEARMINT SCHNAPPS"            
## [2,] "PEACH BRANDIES"  "DECANTERS & SPECIALTY PACKAGES"
## [3,] "BARBADOS RUM"    "BOTTLED IN BOND BOURBON"       
##      DARK.CREME.DE.CACAO              DECANTERS...SPECIALTY.PACKAGES
## [1,] "HIGH PROOF BEER"                "DISTILLED SPIRITS SPECIALTY" 
## [2,] "PEACH BRANDIES"                 "AMERICAN COCKTAILS"          
## [3,] "DECANTERS & SPECIALTY PACKAGES" "CREME DE ALMOND"             
##      DISTILLED.SPIRITS.SPECIALTY      FLAVORED.GIN                    
## [1,] "DECANTERS & SPECIALTY PACKAGES" "PEACH BRANDIES"                
## [2,] "PEACH BRANDIES"                 "HIGH PROOF BEER"               
## [3,] "BARBADOS RUM"                   "DECANTERS & SPECIALTY PACKAGES"
##      FLAVORED.RUM                     FLAVORED.VODKA                  
## [1,] "HIGH PROOF BEER"                "PEACH BRANDIES"                
## [2,] "PEACH BRANDIES"                 "HIGH PROOF BEER"               
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
##      GRAPE.SCHNAPPS                   GREEN.CREME.DE.MENTHE           
## [1,] "PEACH BRANDIES"                 "DECANTERS & SPECIALTY PACKAGES"
## [2,] "DECANTERS & SPECIALTY PACKAGES" "PEACH BRANDIES"                
## [3,] "APRICOT BRANDIES"               "BARBADOS RUM"                  
##      HIGH.PROOF.BEER         IMPORTED.AMARETTO               
## [1,] "BUTTERSCOTCH SCHNAPPS" "HIGH PROOF BEER"               
## [2,] "CREAM LIQUEURS"        "PEACH BRANDIES"                
## [3,] "CINNAMON SCHNAPPS"     "DECANTERS & SPECIALTY PACKAGES"
##      IMPORTED.DRY.GINS IMPORTED.GRAPE.BRANDIES         
## [1,] "HIGH PROOF BEER" "PEACH BRANDIES"                
## [2,] "PEACH BRANDIES"  "DECANTERS & SPECIALTY PACKAGES"
## [3,] "CREME DE ALMOND" "CREME DE ALMOND"               
##      IMPORTED.SCHNAPPS                IMPORTED.VODKA   
## [1,] "HIGH PROOF BEER"                "HIGH PROOF BEER"
## [2,] "PEACH BRANDIES"                 "PEACH BRANDIES" 
## [3,] "DECANTERS & SPECIALTY PACKAGES" "CREME DE ALMOND"
##      IMPORTED.VODKA...MISC IRISH.WHISKIES                  
## [1,] "PEACH BRANDIES"      "HIGH PROOF BEER"               
## [2,] "CREME DE ALMOND"     "PEACH BRANDIES"                
## [3,] "HIGH PROOF BEER"     "DECANTERS & SPECIALTY PACKAGES"
##      JAMAICA.RUM                      LOW.PROOF.VODKA                 
## [1,] "HIGH PROOF BEER"                "PEACH BRANDIES"                
## [2,] "PEACH BRANDIES"                 "HIGH PROOF BEER"               
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
##      MISC..AMERICAN.CORDIALS...LIQUEURS MISC..IMPORTED.CORDIALS...LIQUEURS
## [1,] "HIGH PROOF BEER"                  "HIGH PROOF BEER"                 
## [2,] "PEACH BRANDIES"                   "PEACH BRANDIES"                  
## [3,] "CREME DE ALMOND"                  "DECANTERS & SPECIALTY PACKAGES"  
##      MISCELLANEOUS.BRANDIES MISCELLANEOUS.SCHNAPPS          
## [1,] "CREME DE ALMOND"      "HIGH PROOF BEER"               
## [2,] "IMPORTED DRY GINS"    "PEACH BRANDIES"                
## [3,] "ROOT BEER SCHNAPPS"   "DECANTERS & SPECIALTY PACKAGES"
##      OTHER.PROOF.VODKA                PEACH.BRANDIES     
## [1,] "PEACH BRANDIES"                 "IMPORTED SCHNAPPS"
## [2,] "DECANTERS & SPECIALTY PACKAGES" "SCOTCH WHISKIES"  
## [3,] "BARBADOS RUM"                   "AMERICAN AMARETTO"
##      PEACH.SCHNAPPS                   PEPPERMINT.SCHNAPPS             
## [1,] "HIGH PROOF BEER"                "HIGH PROOF BEER"               
## [2,] "PEACH BRANDIES"                 "PEACH BRANDIES"                
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
##      PUERTO.RICO...VIRGIN.ISLANDS.RUM RASPBERRY.SCHNAPPS              
## [1,] "HIGH PROOF BEER"                "HIGH PROOF BEER"               
## [2,] "PEACH BRANDIES"                 "PEACH BRANDIES"                
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
##      ROCK...RYE                       ROOT.BEER.SCHNAPPS              
## [1,] "PEACH BRANDIES"                 "PEACH BRANDIES"                
## [2,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
## [3,] "BARBADOS RUM"                   "MISCELLANEOUS BRANDIES"        
##      SCOTCH.WHISKIES                  SINGLE.BARREL.BOURBON.WHISKIES
## [1,] "HIGH PROOF BEER"                "PEACH BRANDIES"              
## [2,] "PEACH BRANDIES"                 "CREME DE ALMOND"             
## [3,] "DECANTERS & SPECIALTY PACKAGES" "LOW PROOF VODKA"             
##      SINGLE.MALT.SCOTCH SPEARMINT.SCHNAPPS              
## [1,] "HIGH PROOF BEER"  "HIGH PROOF BEER"               
## [2,] "PEACH BRANDIES"   "CREME DE ALMOND"               
## [3,] "CREME DE ALMOND"  "DECANTERS & SPECIALTY PACKAGES"
##      SPICED.RUM                       STRAIGHT.BOURBON.WHISKIES       
## [1,] "HIGH PROOF BEER"                "HIGH PROOF BEER"               
## [2,] "PEACH BRANDIES"                 "PEACH BRANDIES"                
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
##      STRAIGHT.RYE.WHISKIES            STRAWBERRY.SCHNAPPS             
## [1,] "PEACH BRANDIES"                 "HIGH PROOF BEER"               
## [2,] "DECANTERS & SPECIALTY PACKAGES" "PEACH BRANDIES"                
## [3,] "BARBADOS RUM"                   "DECANTERS & SPECIALTY PACKAGES"
##      TENNESSEE.WHISKIES               TEQUILA                         
## [1,] "PEACH BRANDIES"                 "HIGH PROOF BEER"               
## [2,] "DECANTERS & SPECIALTY PACKAGES" "PEACH BRANDIES"                
## [3,] "HIGH PROOF BEER"                "DECANTERS & SPECIALTY PACKAGES"
##      TRIPLE.SEC                       TROPICAL.FRUIT.SCHNAPPS         
## [1,] "HIGH PROOF BEER"                "HIGH PROOF BEER"               
## [2,] "PEACH BRANDIES"                 "PEACH BRANDIES"                
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
##      WATERMELON.SCHNAPPS              WHISKEY.LIQUEUR                 
## [1,] "HIGH PROOF BEER"                "HIGH PROOF BEER"               
## [2,] "PEACH BRANDIES"                 "PEACH BRANDIES"                
## [3,] "DECANTERS & SPECIALTY PACKAGES" "DECANTERS & SPECIALTY PACKAGES"
##      WHITE.CREME.DE.CACAO             WHITE.CREME.DE.MENTHE           
## [1,] "DECANTERS & SPECIALTY PACKAGES" "PEACH BRANDIES"                
## [2,] "PEACH BRANDIES"                 "DECANTERS & SPECIALTY PACKAGES"
## [3,] "CREME DE ALMOND"                "BARBADOS RUM"

Unfortunately liquors which are outliers in terms of the very low quantities ordered, in generate correlate highly negatively with the other categories.

Looking at Tequila’s top negative correlations we see high proof beer, peach brandies and decanters & specialty packages.

It would be useful to plot tequila against its highest positive and negative correlations and see what we get

#Produce plots

test1 <- ggplot(data=gr_date_cat.spread[,c(64,44)],aes(x=`MISC. IMPORTED CORDIALS & LIQUEURS`, y=TEQUILA)) +
  geom_point() +
  geom_smooth(method=lm) +  
  geom_text(x = 1500, y = 10000, 
            label = lm_eqn(lm(TEQUILA~`MISC. IMPORTED CORDIALS & LIQUEURS`,gr_date_cat.spread[c(64,44)])),
            parse = TRUE, size = 3)

test2 <- ggplot(data=gr_date_cat.spread[,c(64,65)],aes(x=`TRIPLE SEC`, y=TEQUILA)) +
  geom_point() +
  geom_smooth(method=lm) +  
  geom_text(x = 1500, y = 10000, 
                label = lm_eqn(lm(TEQUILA~`TRIPLE SEC`,gr_date_cat.spread[c(64,65)])),
                parse = TRUE, size = 3)

test3 <- ggplot(data=gr_date_cat.spread[,c(64,39)],aes(x=`IMPORTED VODKA - MISC`, y=TEQUILA)) +
  geom_point() +
  geom_smooth(method=lm) +  
  geom_text(x = 1500, y = 10000, 
            label = lm_eqn(lm(TEQUILA~`IMPORTED VODKA - MISC`,gr_date_cat.spread[c(64,39)])),
            parse = TRUE, size = 3)


test4 <- ggplot(data=gr_date_cat.spread[,c(64,33)],aes(x=`HIGH PROOF BEER`, y=TEQUILA)) +
  geom_point() +
  geom_smooth(method=lm, colour="red") +  
  geom_text(x = 5, y = 15000, 
            label = lm_eqn(lm(TEQUILA~`HIGH PROOF BEER`,gr_date_cat.spread[c(64,33)])),
            parse = TRUE, size = 3)

test5 <- ggplot(data=gr_date_cat.spread[,c(64,48)],aes(x=`PEACH BRANDIES`, y=TEQUILA)) +
  geom_point() +
  geom_smooth(method=lm, colour="red") +  
  geom_text(x = 500, y = 10000, 
            label = lm_eqn(lm(TEQUILA~`PEACH BRANDIES`,gr_date_cat.spread[c(64,48)])),
            parse = TRUE, size = 3)

test6 <- ggplot(data=gr_date_cat.spread[,c(64,26)],aes(x=`DECANTERS & SPECIALTY PACKAGES`, y=TEQUILA)) +
  geom_point() +
  geom_smooth(method=lm, colour="red") +  
  geom_text(x = 5000, y = 12000, 
            label = lm_eqn(lm(TEQUILA~`DECANTERS & SPECIALTY PACKAGES`,gr_date_cat.spread[c(64,26)])),
            parse = TRUE, size = 3)

multiplot(test1, test2, test3, test4, test5, test6, cols=2)

Here we see that for the highest positive correlations there is indeed a clear linear trend. In the contrary, the highest negative correlations do not really display a strong linear trend.

Out of curiosity, let’s produce the relevant displays for 80 proof vodka.

test1 <- ggplot(data=gr_date_cat.spread[,c(4,59)],aes(x=`SPICED RUM`, y=`80 PROOF VODKA`)) +
  geom_point() +
  geom_smooth(method=lm) +  
  geom_text(x = 1500, y = 40000, 
            label = lm_eqn(lm(`80 PROOF VODKA`~`SPICED RUM`,gr_date_cat.spread[c(4,59)])),
            parse = TRUE, size = 3)

test2 <- ggplot(data=gr_date_cat.spread[,c(4,51)],aes(x=`PUERTO RICO & VIRGIN ISLANDS RUM`, y=`80 PROOF VODKA`)) +
  geom_point() +
  geom_smooth(method=lm) +  
  geom_text(x = 1500, y = 40000, 
                label = lm_eqn(lm(`80 PROOF VODKA`~`PUERTO RICO & VIRGIN ISLANDS RUM`,gr_date_cat.spread[c(4,51)])),
                parse = TRUE, size = 3)

test3 <- ggplot(data=gr_date_cat.spread[,c(4,60)],aes(x=`STRAIGHT BOURBON WHISKIES`, y=`80 PROOF VODKA`)) +
  geom_point() +
  geom_smooth(method=lm) +  
  geom_text(x = 1500, y = 40000, 
            label = lm_eqn(lm(`80 PROOF VODKA`~`STRAIGHT BOURBON WHISKIES`,gr_date_cat.spread[c(4,60)])),
            parse = TRUE, size = 3)


test4 <- ggplot(data=gr_date_cat.spread[,c(4,33)],aes(x=`HIGH PROOF BEER`, y=`80 PROOF VODKA`)) +
  geom_point() +
  geom_smooth(method=lm, colour="red") +  
  geom_text(x = 5, y = 15000, 
            label = lm_eqn(lm(`80 PROOF VODKA`~`HIGH PROOF BEER`,gr_date_cat.spread[c(4,33)])),
            parse = TRUE, size = 3)

test5 <- ggplot(data=gr_date_cat.spread[,c(4,48)],aes(x=`PEACH BRANDIES`, y=`80 PROOF VODKA`)) +
  geom_point() +
  geom_smooth(method=lm, colour="red") +  
  geom_text(x = 500, y = 10000, 
            label = lm_eqn(lm(`80 PROOF VODKA`~`PEACH BRANDIES`,gr_date_cat.spread[c(4,48)])),
            parse = TRUE, size = 3)

test6 <- ggplot(data=gr_date_cat.spread[,c(4,26)],aes(x=`DECANTERS & SPECIALTY PACKAGES`, y=`80 PROOF VODKA`)) +
  geom_point() +
  geom_smooth(method=lm, colour="red") +  
  geom_text(x = 5000, y = 12000, 
            label = lm_eqn(lm(`80 PROOF VODKA`~`DECANTERS & SPECIALTY PACKAGES`,gr_date_cat.spread[c(4,26)])),
            parse = TRUE, size = 3)

multiplot(test1, test2, test3, test4, test5, test6, cols=2)

Spiced rum, Puerto Rico rum and straight bourbon whiskies are the highest positive correlations here and one can see a clear linear trend with very high R-squared values. Coincidentally (?) the same three categories are the highest negative correlations. Whilst high proof beer is most definitely an outlier in terms of either data recording or actual frequency of orders, peach brandies and decanters appear probably uncorrelated altogether. To be fair there does appear to be some negative association for peach brandies, but if one looks at the highest negative correlations for all categories, these two appear extremely frequently - so they are genuine outliers in terms of order patterns.


Final Plots and Summary

Plot One

Description One

The plot above shows an overlay of two time-series demonstrating measures of bottle quantities ordered. Specifically:

  • The raw time-series of bottle quantities ordered
  • The adjusted time-series, where quantities are divided by zipcode population (per order)

It is clear that when the quantities were population-adjusted, a clear periodicaal pattern is observable. Quantities of bottles ordered per citizen always peak on Tuesdays (annotated by the overlay of dashed lines). Monday and Wednesday are pretty much the same; after Wednesday the quantities decline. This pattern does makes sense: Monday is the first day of the week and the stores calculate their weekly requirements by the end of the day. They then submit the order either at end of business or early Tuesday; this allows time for the bottles to be delivered prior to the weekend. They shouldn’t need to order too many bottles after the early week’s budgeting.

Plot Two

Description Two

This time-series of boxplots shows the monthly number of orders submitted by liquor stores; the colour of the boxplots shows the number of participating stores on that particular month.

It is evident that some data-recording issue takes place in June (perhaps relevant to fiscal years?), as orders reduce dramatically and do not recover later in the year.

This plot shows that it is best to separate the dataset into 2 periods (Jan-May, June-Dec) for a more robust time-series analysis.

Plot Three

Description Three

This composite plot shows time series of bottle quantities ordered for all liquor categories in the data; also here the seasons are overlaid in order to aid the visual interpretation of the individual plots. There are many investigations one can partake in with these data, but at an elementary level the liquors can be categorised to “winter” and “summer” drinks.

We can identify that certain liquor types have a demand curve that only peaks in December, for example liqueurs with various flavours like crème de menthe/cacao/etc., brandies and some types of schnapps (cinnamon, butterscotch). Given that these types of liquors have a heavier taste and are generally considered “warming”, their demand curve makes sense - we have identified the “winter drinks”.

Another clear category of demand patterns is observable for a different set of liquors (tequila, triple sec, flavoured rums, imported vodka), where bottle quantities are positively correlated with expected average temperatures (demand lowest in winter, climbs gradually in spring, peaks between mid-spring/mid-summer, gradually descends in autumn). As opposed to the previous category, these are the “summer drinks”.


Reflection

Due to time limitations this project as it was submitted is not thorough, in the sense that most insights are derived from superficial analyses. Most of my observations are intuitive, and I have not succeeded in revealing any hidden patterns within the time-frame (perhaps with the exception of the clear periodicity of the total population-adjusted bottle quantity orders) - I fell into many dead-end analysis attempts.

In general, I think that identifying the auxilliary datasets and merging them to the master took a longer time than I expected and this limited the amount of time I could dedicate to the actual analysis. However, I think that combining the Iowa Liquor Sales data with the auxiliaries has provided a large, granular and rich data set that enables anyone interested to play with the data and see what they find - I am sharing it on Dropbox.

This project gave me the opportunity to hone my skills in R data wrangling (packages data.table, dplyr are awesome) and data visualisation (ggplot2); not a lot was done in terms of statistical modelling though. I was also happy I got to use the stuff I learnt in a previous Udacity course in scraping tables and request data using API’s. Additionally, I started to use git and GitHub - the integration in RStudio is great.

A little disclaimer about the project.Rdata file I’ve added to the repo; I’ve dropped quite a few columns in order to reduce file-size and conform with GitHub rules - it allows faster knitting of the .rmd that would otherwise take ages. (For the full enriched dataset please see Dropbox link).